Energy-Efficient ASIC Accelerators for Machine/Deep Learning Algorithms by Kim, Minkyu (Author) et al.
Energy-Efficient ASIC Accelerators for Machine/Deep Learning Algorithms
by
Minkyu Kim
A Dissertation Presented in Partial Fulfillment
of the Requirements for the Degree
Doctor of Philosophy
Approved November 2019 by the
Graduate Supervisory Committee:
Jae-sun Seo, Chair
Yu (Kevin) Cao
Sarma Vrudhula
Umit Ogras
ARIZONA STATE UNIVERSITY
December 2019
ABSTRACT
While machine/deep learning algorithms have been successfully used in many
practical applications including object detection and image/video classification, ac-
curate, fast, and low-power hardware implementations of such algorithms are still a
challenging task, especially for mobile systems such as Internet of Things, autonomous
vehicles, and smart drones.
This work presents an energy-efficient programmable application-specific inte-
grated circuit (ASIC) accelerator for object detection. The proposed ASIC supports
multi-class (face/traffic sign/car license plate/pedestrian), many-object (up to 50) in
one image with different sizes (6 down-/11 up-scaling), and high accuracy (87% for
face detection datasets). The proposed accelerator is composed of an integral channel
detector with 2,000 classifiers for five rigid boosted templates to make a strong object
detection. By jointly optimizing the algorithm and efficient hardware architecture,
the prototype chip implemented in 65nm demonstrates real-time object detection of
20-50 frames/s with 22.5-181.7mW (0.54-1.75nJ/pixel) at 0.58-1.1V supply.
In this work, to reduce computation without accuracy degradation, an energy-
efficient deep convolutional neural network (DCNN) accelerator is proposed based on
a novel conditional computing scheme and integrates convolution with subsequent
max-pooling operations. This way, the total number of bit-wise convolutions could
be reduced by ∼2×, without affecting the output feature values. This work also has
been developing an optimized dataflow that exploits sparsity, maximizes data re-use
and minimizes off-chip memory access, which can improve upon existing hardware
works. The total off-chip memory access can be saved by 2.12×. Preliminary results
of the proposed DCNN accelerator achieved a peak 7.35 TOPS/W for VGG-16 by
post-layout simulation results in 40nm.
i
A number of recent efforts have attempted to design custom inference engine based
on various approaches, including the systolic architecture, near memory processing,
and in-meomry computing concept. This work evaluates a comprehensive compari-
son of these various approaches in a unified framework. This work also presents the
proposed energy-efficient in-memory computing accelerator for deep neural networks
(DNNs) by integrating many instances of in-memory computing macros with an en-
semble of peripheral digital circuits, which supports configurable multibit activations
and large-scale DNNs seamlessly while substantially improving the chip-level energy-
efficiency. Proposed accelerator is fully designed in 65nm, demonstrating ultralow
energy consumption for DNNs.
ii
ACKNOWLEDGMENTS
I sincerely would like to thank my advisor, Dr. Jae-sun Seo for the opportunity,
motivation, and guidance throughout my doctorate stiduies. His insightfull recom-
mendations helped me navigat through challenges that I faced during my study.
Without his mentoring and persistent help this dissertation would not have been pos-
sible. I am also thankful to Dr. Yu (Kevin) Cao, Dr. Sarma Vrudhula, and Dr. Umit
Ogras for taking out time and being in my Ph.D. defense committee.
I am also specially thankful to my master degree advisor, Dr. Young Hwan Kim
(Professor, Department of Electrical Engineering at POSTECH), and my colleague as
mentor, Dr. Suk-Ju Kang (Associate Professor, Department of Electronic Engineering
at Sogang University), for their persistent encouragement and guidance to my PhD
journey.
I would also like to thank for encouragement, frequent interactions and help with
a set of colleagues and friends: Dr. Deepak Kadetotad, Xiaoyang Mi, Shihui Yin,
Ussama Awais, Shreyas Venkataramanaiah, Sai Kiran Cherupally, Jyotishman Saika,
Dr. Abinash Mohanty, Dr. Yufei Ma, Dr. Naveen Suda, Dr. Rui Liu, Xiaochen Peng,
Xiaoyu Sun, Dr. Doohwang Chang, and Seunghyun Lee.
Special thanks to my family. My parents, Seoungdoo Kim and Heejung Moon,
my parents-in-law, Jongseob Kim and Kisoon Park, my sister, Nayoung Kim, my
brother-in-law, Yong Tae An and Jeongmin Kim, my lovely wife, Jeong A Kim, and
my valuable sons, Jeeyule Kim and William Doha Kim deserve most of the credit for
this work. Without their selfless love, sacrifice, endless support, and encouragement
none of this would have been possible. I dedicated this thesis to my amazing family.
iii
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 AN ENERGY-EFFICIENT HARDWARE IMPLEMENTATION OF OB-
JECT DETECTION ACCELERATOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Overview of the Object Detection Algorithm (HeadHunter Model) . 12
2.3 Energy Efficient Hardware Architecture Based on Rigid Boosted
Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Hardware Architecture and Operation . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Hardware Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Algorithm Adaptiations for Hardware Efficiency . . . . . . . . . . . . . . . . . . 23
2.5 65nm Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 PRECISION-CASCADING BASED HARDWARE ACCELERATOR FOR
DEEP CONVOLUTIONAL NEURAL NETWORK . . . . . . . . . . . . . . . . . . . . 39
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Overview of the Proposed Conditional Computing Scheme . . . . . . . . . 42
3.2.1 Precision-Cascading (PC) Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.2 Fully Zero Skipping (ZS) Scheme Integrating Precision-
Cascading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Proposed Convolution Loop Acceleration Strategy . . . . . . . . . . . . . . . . 44
3.4 Energy-Efficient DCNN Architecture Based on PC and ZS . . . . . . . . . 48
iv
CHAPTER Page
3.5 Implementation Results in 40nm CMOS . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4 ENERGY-EFFICIENT IN-MEMORY COMPUTING ACCELERATOR
FOR DEEP NEURAL NETWORKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Recent In-Memory Computing Hardware Designs for DNN. . . . . . . . . 63
4.3 XNOR-SRAM: Scalable SRAM Macro for In-Memory Computing . . 66
4.4 Practical Challenges of In-Memory Computing-Based Accelerators . 69
4.5 Microarchitecture of the Proposed Accelerator . . . . . . . . . . . . . . . . . . . . 71
4.5.1 Microarchitecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5.2 Multibit Activation Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.3 Activation Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5.4 Mapping of Convolution, Fully Connected, and Other Layers 78
4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.6.2 Area, Energy, and Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5 ARCHITECTURE BENCHMARK OF NEURO-INSPIRED COMPUT-
ING SYSTEM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Systolic and Near Memory Processing Architecture Design . . . . . . . . . 90
5.2.1 Systolic Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2.2 Near Memory Processing (NMP) Architecture Design with
SRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
v
CHAPTER Page
5.3 Processing-In-Memory Architecture Design Based on RRAM . . . . . . . 92
5.3.1 Pseudo-Crossbar Array Structure . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3.2 Mapping Kernels in Crossbar Arrays . . . . . . . . . . . . . . . . . . . . . . 94
5.3.3 Chip Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4 Benchmarking Across Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.4.2 Benchmark Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 103
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
vi
LIST OF TABLES
Table Page
2.1 Chip Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Power Breakdown with Various Configurations . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Delay Time, Power, Energy Versus Different Number of Stages in
Adaptive Cascading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Comparison to Prior ASIC Works on Object Detection . . . . . . . . . . . . . . . 38
3.1 Analysis Results of Zero Percentage at VGG-16 for ILSVRC2012 Valid
Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Analysis Results of DRAM Access on Different Cases of Inter-tiling
Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Chip Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 Performance Breakdown of the VGG-16 in ILSVRC2012 . . . . . . . . . . . . . . 58
4.1 Comparison with Prior Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.1 Architecture Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 VGG-like CNN Layer Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3 Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
vii
LIST OF FIGURES
Figure Page
2.1 Illustration of Multi-class Object Detection (E.G., Face, Traffic Sign,
Car License Plate, Pedestrian) with 10 Channels, 17 Scales, 2000 Weak
Classifiers, and Non-maximum Suppression. . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Multiple Channel Features with Six Hogs, a Gradient Magnitude, and
Luv Color Space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 The Concept of Fast Computation for an Area Sum Using Integral Image. 13
2.4 Conceptual Operation of 2,000 Weak Classifiers. . . . . . . . . . . . . . . . . . . . . . 15
2.5 High-level Pseudo-code of the Overall Object Detection Operation
(Left) and Corresponding Modular Nested Structure on Hardware (Right). 16
2.6 Top-level Block Diagram and the End-to-end Data Flow of Proposed
Object Detection Accelerator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Block Diagram and Data Flow of Classifier Operation. . . . . . . . . . . . . . . . . 18
2.8 Illustration of the Down-sampling and Storage of Generated Channel
Data in (a) the Baseline Scheme and (b) the Proposed Scheme. . . . . . . . 20
2.9 Illustration of Obtaining Correct Integral Data over 20x20 Window
with 12x10 Window of Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.10 Pre-processing Step for the NMS Function Is Illustrated. . . . . . . . . . . . . . . 23
2.11 Data Re-use and Parallel Computing Scheme for Multiple Adjacent
Search Windows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.12 Weight Reordering and Adaptive Classification. If the Intermittent
Sum Is Larger than Upper Threshold (Left) or Smaller than Lower
Threshold (Right), the Remaining Classifier Operations Are Skipped.
Otherwise, 2000 Classifiers Are Computed (Middle). . . . . . . . . . . . . . . . . . . 24
2.13 65nm Prototype Chip Micrograph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
viii
Figure Page
2.14 Chip Measurement Results of Multi-scale Multi-object Detection on
Face and Traffic Sign Images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.15 System Test Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.16 Measured Frame Rate and Total/Leakage Power with Voltage Scaling. . 30
2.17 Measured Precision Versus Recall Curve with Multiple Object Classes. . 33
2.18 Precision-recall Curves on the Fddb Datasets for the Different Number
of Weak Classifiers in Our Proposed Adaptive Classifier Cascading. . . . . 34
2.19 Precision-recall Curves on the Fddb Datasets for the Various Number
of Scale Factors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.20 (a) Design Comparison Using Up-/Down-scaling. (b) Measurement
Results Show That Smaller Faces Can Be Detected Through Up-scaling. 36
2.21 (a) Area and (b) Power Breakdown of the Overall System. . . . . . . . . . . . . 37
3.1 Precision-cascading Multiplication of Input Feature by Kernel Feature. . 42
3.2 The Conceptual Operation of Precision-cascading Scheme. . . . . . . . . . . . . 42
3.3 The Conceptual Operation of Fully Zero Skipping Scheme. . . . . . . . . . . . . 44
3.4 Illustration of Integrating PC and ZS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Four Levels of Convolution Loops, Where L Denotes the Index of Con-
volution Layer and S Denotes the Sliding Stride. . . . . . . . . . . . . . . . . . . . . . 45
3.6 Analysis of Unroll Loop-2 Vs. Unroll Loop-4. . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 Illustration of Inter-tiling Loop Order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.8 The Concept of Special and Efficient Architecture for PC and ZS. . . . . . 49
3.9 Top-level Block Diagram and the End-to-end Data Flow of Proposed
DCNN Accelerator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.10 Supports Various Kernel Size for (a) 3 X 3, (b) 5 X 5, (c) 7 X 7. . . . . . . 52
ix
Figure Page
3.11 The Proposed Accelerator Chip of Micrograph. . . . . . . . . . . . . . . . . . . . . . . . 53
3.12 Example of Loading Sparsity Map/Input Features from External Mem-
ory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.13 Dram Access Comparison to Eyeriss in VGG-16 Convolution Layers. . . . 55
3.14 Measured Frame Rate per Second and Total/Leakage Power with Volt-
age Scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.15 (a) Power and (b) Area Breakdown of the Overall System. . . . . . . . . . . . . 56
3.16 Latency Comparison Between with PC + ZS and without PC + ZS. . . . 57
3.17 Energy-Efficiency Comparison Between with PC + ZS and without PC
+ ZS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 Left: Conventionally SRAM Data Are Read You Row-by-row to Per-
form Computation at the Periphery. Right: In-memory Computing
Schemes Embed Logic Computation Inside SRAM by Turning on All
Row Simultaneously. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Comparison of Recent In-SRAM Computing Hardware Demonstrations. 64
4.3 XNOR-SRAM Design Proposed in (Jiang et al. (2018)). . . . . . . . . . . . . . . 67
4.4 (a) Overall Microarchitecture of the Proposed In-memory Computing
Accelerator. (b) Computations for the Thermometer-to-binary Con-
version, LUT, Batch Normalization, and so On. (c) Block Diagram of
the Activation Memory Buffer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Timing Diagram of the Accelerator Operation for Two Adjacent Layers
of a CNN, Including In-memory Computing, Double-buffering, and
Peripheral Computations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
x
Figure Page
4.6 Classification Accuracy for CIFAR-10 Data Set Is Shown Across Dif-
ferent Activation Precision Values for Four Different DNN Sizes. For
All Data Points, the Weight Precision Is Binary (Only Two Values of
+1 or -1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.7 Illustration of Convolution Layer Feature Map Storage and Access
Scheme in Nine Independent SRAM Arrays. (a) 3x3 Window Starting
From (0,0). (b) 3x3 Window Starting from (0,1). (c) 3x3 Window
Starting from (2,2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.8 Mapping Convolution Layers (Left) and Fully Connected Layers (Right)
of Deep CNNs onto the Proposed Sccelerator Employing XNOR-SRAM
Macros with In-memory Computing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.9 Including the XNOR-SRAM Prototype Chip Layout, the Layout of
Activation Memory Buffer/Controller, Accumulation, and Batch Nor-
malization Modules Are Shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.10 Energy Breakdown of the Entire MLP Designed for MNIST Data Set. . 84
4.11 Energy Breakdown of the Entire CNN Designed for CIFAR-10 Data
Set. Two Different Size of Cnns (1x and 0.5x) and Three Different
Activation Precision Schemes (1-3 Bit) Are Shown. . . . . . . . . . . . . . . . . . . . 85
5.1 The Diagram of Conventional Systolic Architecture. . . . . . . . . . . . . . . . . . . 90
5.2 The Diagram of near Memory Processing (NMP) Architecture, Where
the SRAM Banks Are Used to save Weight Data. . . . . . . . . . . . . . . . . . . . . 92
5.3 The Diagram of Pseudo-crossbar Array, Which Perform Analog Ma-
trixvector Multiplication by Accumulating Currents Through Source-
lines (SLs) Naturally. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
xi
Figure Page
5.4 A Mapping Method of Input Data and Kernels in Convolutional Layers
to the Crossbar Array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 The Diagram of Processing-in-memory (PIM) Architecture Based On
RRAM Technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.6 Classification Accuracy of CIFAR-10 for an 8-bit Cnn as a Function of
the ADC Precision for Partial Sums. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.7 Example of Pipeline in RRAM Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.8 Energy Breakdown of Systolic Architecture, NMP Architecture and
Pipelined Parallel RRAM Architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.9 Sequential and Parallel RRAM Architectures with and Without Pipelin-
ing, for Cell-precision of 1-bit, 2-bit and 4-bit. . . . . . . . . . . . . . . . . . . . . . . . . 105
5.10 Area Breakdown of (a) Systolic Architecture; (b) NMP Architecture,
and (c) Pipelined Parallel RRAM Architecture (4-bit/Cell). . . . . . . . . . . . 106
xii
Chapter 1
INTRODUCTION
Machine learning has become ubiquitous in applications including object detec-
tion, image/video classification, speech recognition, and natural language processing.
Deep neural networks (DNNs), including convolutional neural networks (CNNs) and
recurrent neural networks (RNNs), have achieved unprecedented accuracy for the
aforementioned tasks, but the network models involve an increasingly high amount
of computation, memory, and communication.
While machine/deep learning algorithms have been successfully used in many
practical applications, accurate, fast, and low-power hardware implementations of
such algorithms is still a challenging task, especially for mobile systems such as In-
ternet of Things (IoT), autonomous vehicles, and smart drones. Hardware designs
using general-purpose processors such as CPUs and GPUs do not provide satisfac-
tory energy efficiency. This is due to (1) high computational complexity that varies
with algorithms and (2) the large memory/communication requirement independent
of input, which generates significant data movement that can be as energy consuming
as computation. Therefore, many prior works have focused on flexible application-
specific integrated circuits (ASICs) to address these challenges. It is crucial to design
a computing scheme that can support high parallelism and optimize the data flow
and communication. The underlying hardware needs to be reconfigurable to support
various models and should reduce the amount of computation data movement de-
pending on the input data. To further improve energy efficiency, optimized memory
hierarchy and data sparsity statistics can also be exploited.
1
High performance and low-power object detection is an essential task for vision
processors. While significant improvement has recently been made in algorithms
(Mathias et al. (2014), Li et al. (2015), Yang et al. (2017), Ranjan et al. (2017)),
hardware practice on conventional CPUs, GPUs (Benenson et al. (2012)), and FPGAs
(Advani et al. (2015)) still lacks sufficient energy-efficiency and speed in order to make
real-time decisions within the power envelope of an embedded systesm. Previous
works have proposed special-purpose ASICs for obejct detection, using hand-crafted
features such as HOGs (Takagi et al. (2014), Suleiman et al. (2017b)) and learned
features such as CNNs (Yin et al. (2017), Chen et al. (2018c)). In general, the CNN
learned features outperform the hand-crafted feature for object detection accuracy
and the hand-crafted features are more energy-efficient than the learned features
for hardware implementations. Employing the HeadHunter model based on rigid
templates (Mathias et al. (2014)), this work proposed an energy-efficient and accurate
ASIC accelerator for object detection that overcome the limitations of two features
based ASICs:
1. Multiple classes (e.g., face, traffic sign, car license plate, pedestrian) that are
programmable in the accelerator
2. Many objects (up to 50) in one image with multiple scales (17-scale support
with 6 down-scaling and 11 up-scaling)
3. High accuracy (average precision of 0.87/0.81/0.72/0.76/ 0.53 in FDDB/AFW/
BTSD/Caltech plate/INRIA Person datasets) comparable to state-of-the-art
algorithms
4. Energy-efficient hardware architecture based on rigid boosted templates for low
power of 22.5mW and low energy per pixel of 0.54 nJ/pixel
2
Special-purpose ASICs for deep convolutional neural networks (DCNNs) acceler-
ator have been previously proposed (Chen et al. (2016), Moons et al. (2017), Lee
et al. (2019)). Eyeriss (Chen et al. (2016)) proposed a spatial architecture with row
stationary data flow to minimize data movement energy cost for any CNN shape and
employed the run-length compression that exploit the statistics of zero data. How-
ever, The accelerator in (Chen et al. (2016)) achieved 245.6 GOPS/W in VGG-16
that is much less than the state-of-the-art architecture (Moons et al. (2017), Lee
et al. (2019)). Envision (Moons et al. (2017)) proposed a dynamic-voltage-accuracy-
frequency scaling scheme and dynamic precision technique with modulating the body
bias. LNPU (Lee et al. (2019)) proposed a fine-grained mixed precision and zero skup-
ping with sparse encoding based on run-length compressiong. Both architectures in
(Moons et al. (2017), Lee et al. (2019)) achieved high energy-efficient of 2 TOPS/W
and 5.84 TFLOPS/W in VGG-16, respectively. Howeever, both architecture (Moons
et al. (2017), Lee et al. (2019)) do not evaluate their implementations on off-chip
memory energy cost.
This work investigated efficient custom hardware chip design of state-of-the-art
CNN algorithms, which are very accurate but require up to hundreds of megabytes
for data storage and billions of operations for a single inference pass. To reduce
computation without accuracy degradation, we proposed an energy-efficient CNN ac-
celerator based on a novel conditional computing scheme, called precision-cascading
(PC) and integrates convolution with subsequent max-pooling operations. In partic-
ular, we divide the input features into a group of precision values and first perform
approximate convolution computations with only the most significant bits (MSBs)
of the feature data. Based on this approximate computation, we find the maximum
value for the pooling output, and if the maximum value cannot be found, the ap-
proximate convolutions are computed in a cascaded manner. Then the full-precision
3
convolution is performed only on the maximum pooling output that is found. This
way, the total number of bit-wise convolutions could be reduced by ∼2×, without
affecting the output feature values and with < 0.8% degradation in final classification
accuracy. In addition, we have been developing an optimized dataflow that exploits
sparsity, maximizes data re-use and minimizes off-chip DRAM access, which can im-
prove upon existing hardware works (Chen et al. (2016), Albericio et al. (2016)).
The total DRAM access can be saved by 2.12× when applying our proposed energy-
efficient data flow. Preliminary results of the proposed DCNN accelerator achieved
a peak 7.35 TOPS/W for VGG-16 by post-layout simulation results in 40nm CMOS
technology, excluding external DRAM access.
In the next phase of the energy-efficient development, a number of recent efforts
have attempted to design custom inference engine across various technological plat-
forms, such as emerging technologies, employing various data-processing approaches.
Among the emerging technologies, the resistive random access memory (RRAM)
can naturally support the matrix-vector multiplication efficiently by exploiting the
multiconductance-state as analog synapses, with a crossbar structure. Recent de-
signs such as ISAAC (Shafiee et al. (2016)), PRIME (Chi et al. (2016)), and PipeLayer
(Song et al. (2017)) demonstrate the RRAM-based PIM is a promising solution for
high energy efficiency with limited onchip area. However, a comprehensive compari-
son among differential approaches such as digital systolic array, digital near-memory
computing, and analog in-memory computing with the same design assumptions and
constraints is still missing in the literature. Therefore, the trade-offs between in-
ference accuracy, latency, and energy across differential technological platforms are
delusive.
This work investigated a comprehensive comparison by performing a holistic com-
parison between three representative architectures: TPU-like systolic array (Kung
4
(1980)), near memory processing with SRAM, and processing-in-memory with binary
or multi-bit RRAM by modifying a circuit-level simulator named NeuroSim (Chen
et al. (2018a)), including overhead of the off-chip DRAM access. This work focused
on the implementation of near memory processing with SRAM in 40nm CMOS tech-
nology for various cases such as XNOR/8-bit SRAM sequential/parallel architectures.
This work finally showed the comparison results of three representative architectures
for the energy efficiency and frame rate by using a VGG-like CNN for inference on
CIFAR-10 dataset.
In recent years, the CMOS ASIC designs (Merolla et al. (2014), Chen et al. (2016),
Moons et al. (2017), Shin et al. (2017)) show that accessing memory is the biggest
bottleneck for energy-efficient real-time cognitive computing in terms of storing mil-
lions of parameters, loading them from embedded SRAM memory, and moving them
to where computing actually occurs. Although SRAM technology has been following
the CMOS scaling trend well, to compute Multiply-and-ACcumulate (MAC) opera-
tions in DNNs, conventional SRAMs still require millions to billions of row-by-row
accesses, which limits the parallelism and dissipates a large amount of read/write en-
ergy. To improve this limitation, in the last couple years, several works proposed the
in-SRAM computing concept, which performs computation in the SRAM hardware
without reading out each row of SRAM to a computing unit. However, most prior
works only demonstrated small DNNs for MNIST data set with relatively low accu-
racy (less than 96%) (Biswas and Chandrakasan (2018), Chen et al. (2018b), Khwa
et al. (2018)).
To implement an overall DNN accelerator using in-memory computing SRAMs,
there are several important challenges and missing pieces, including integration of
many in-memory computing SRAM arrays, activation storage/communication, ad-
ditional digital logic for non-MAC operations, and row-by-row write energy consid-
5
eration. This work substantially expanded the single-arraylevel prior XNOR-SRAM
design (Jiang et al. (2018)) toward a configurable DNN accelerator architecture that
integrates 72 XNOR-SRAM arrays with interarray communication and supports a
wide range of DNN/CNN algorithms with configurable activation precision. The pro-
posed accelerator supports 3×3 and 1×1 convolutional kernels and up to 256 feature
maps in a convolutional layer. The main contributions of this work are as follows:
1. The work constructed the chip-level DNN/CNN accelerator architecture that
employs many instances of in-memory computing SRAM (e.g., XNOR-SRAM)
macros with a methodology to efficiently load/map weights onto such XNOR-
SRAM arrays for convolution layers and fully connected layers of DNNs.
2. By reusing the XNOR-SRAM macro originally designed for binary activations
and weights, this work added peripheral digital logic that can support multibit
activations, which becomes an effective knob to favorably tradeoff energy versus
accuracy.
3. This work employed double-buffering technique with two groups of in-memory
computing SRAMs, which effectively hides the latencies of reprogramming in-
memory computing SRAM arrays with new DNN weights.
4. The work evaluateed the chip-level energy benefits and remaining bottlenecks
of in-memory computing-based DNN accelerators.
The rest of this work is organized as follows. Chapter 2 demonstrates the proposed
energy-efficient programmable ASIC accelerator for object detection, which supports
multi-class, many-object in one image with different sizes, and high accuracy. This
chapter explains the employed object detection algorithm (HeadHunter model) in
detail and introduces the system architecture of the proposed hardware acceleraator
6
including detailed features of main modules and hardware optimization techniques.
It also describes the proposed algorithm adaptations that were employed to impove
the hardware efficiency. This chapter finally presents the evaluation results of the
prototype chip implemented in 65nm CMOS.
In chapter 3, this work describes the proposed novel conditional computing scheme,
which we call precision-cascading (PC) and fully zero-skipping (ZS). It also presents
the proposed convolution loop acceleration strategy for better energy-efficiency. This
chapter describes the proposed energy-efficient DCNN accelerator architecture and
reports the implementation results in 40nm CMOS on speed, energy dissipation, and
DRAM access for an inference.
In chapter 4, this work reviews the architecture benchmark of neuro-inspired com-
puting system. This chapter introduces three representatives architecture such as
systolic, near memory processing with SRAM, and processing-in-memory based on
RRAM. It also describes the implementation of various near memory processing ar-
chitectures with SRAM in 40nm CMOS technology. This chapter finally reports the
comprehensive comparison results on the architecture benchmarks.
Chapter 5 reviews the XNOR-SRAM macro and discusses practical design chal-
lenges when we designed a chip-level DNN accelerator using many in-SRAM macros.
It also describes the microarchitecture of the proposed accelerator and the method-
ology to efficiently map optimized DNNs/CNNs onto XNOR-SRAM arrays. This
chapter reports experimental results on speed, energy dissipation, and classification
accuracy across several workloads.
Chapter 6 concludes this work.
7
Chapter 2
AN ENERGY-EFFICIENT HARDWARE IMPLEMENTATION OF OBJECT
DETECTION ACCELERATOR
This chapter proposes an energy-efficient hardware implementation of object de-
tection accelerator. Machine learning has become ubiquitous in applications including
object detection, image/video classification, and natural language processing. While
machine learning algorithms have been successfully used in many practical applica-
tions, accurate, fast, and low-power hardware implementations of such algorithms are
still a challenging task, especially for mobile systems such as Internet of Things (IoT),
autonomous vehicles, and smart drones.
This work presents an energy-efficient programmable ASIC accelerator for object
detection. Our ASIC accelerator supports multi-class (e.g., face, traffic sign, car
license plate, and pedestrian) that are programmable, many-object (up to 50) in one
image with different sizes (17-scale support with 6 down-/11 up-scaling), and high
accuracy (AP of 0.87/0.81/0.72/0.76 for FDDB/AFW/BTSD/ Caltech datasets).
We designed an integral channel detector with 2,000 classifiers for five rigid boosted
templates, where the number of stages used for classification can be adaptively con-
trolled depending on the content of the search window, which makes a strong object
detection. This can be implemented with a more modular hardware, compared to sup-
port vector machine (SVM) and deformable parts model (DPM) designs. By jointly
optimizing the algorithm and efficient hardware architecture, the prototype chip im-
plemented in 65nm demonstrates real-time object detection of 20-50 frames/s with
low power consumption of 22.5-181.7mW (0.54-1.75 nJ/pixel) at 0.58-1.1V supply.
8
2.1 Introduction
Object detection is essential for intelligent computer vision applications such as
augmented reality (AR), advanced driver assistant systems (ADAS), autonomous con-
trol in unmanned aerial vehicles (UAV), smart drones, surveillance systems, and In-
ternet of Things (IoT). Real-time, high accurate and energy-efficient object detection
is an essential task for these applications. While significant improvement has recently
been made in algorithms (Viola et al. (2001), Felzenszwalb et al. (2009), Mathias
et al. (2014), Li et al. (2015), Yang et al. (2017), Ranjan et al. (2017)), hardware de-
signs using general-purpose processors such as CPUs, GPUs (Benenson et al. (2012)),
and FPGAs (Advani et al. (2015)) do not provide satisfactory energy efficiency and
speed in order to make real-time decisions within the power envelope of embedded
systems. This is due to high computational complexity that varies with algorithms
and the large memory/communication requirement independent of input, which gen-
erates significant data movement that can be as energy consuming as computation.
Special-purpose ASICs for object detection have been previously proposed (Takagi
et al. (2014), Jeon et al. (2015), Suleiman and Sze (2016), Suleiman et al. (2017b)).
A real-time object detection engine using a Histogram of Oriented Gradients (HOG)
feature extraction in Support Vector Machine (SVM) was presented in (Takagi et al.
(2014)). However, the implementation only supported one scale factor, limiting the
detection accuracy and robustness. The authors of (Jeon et al. (2015)) designed a
specialized engine for face detection and recognition with low power consumption of
23mW, but was not able to support multi-scale factors or multiple faces. Multi-scale
pedestrian detection was achieved in (Suleiman and Sze (2016)) with 12 scale factors,
but only down-scaling was used, limiting the detection of objects with small number
of pixels. A multi-object detection accelerator with Deformable Parts Model (DPM)
9
was implemented in (Suleiman et al. (2017b)) with two programmable object classifi-
cation engines for 58.6mW power consumption, but still only supported down-scaling.
In this chapter, we propose an energy-efficient programmable ASIC accelerator (Kim
et al. (2019)) for object detection that overcomes the above limitations:
• Multiple classes (e.g., face, traffic sign, car license plate, pedestrian) that are
programmable in the accelerator
• Many objects (up to 50) in one image with multiple scales (17-scale support
with 6 down-scaling and 11 up-scaling)
• High accuracy (average precision of 0.87/0.81/0.72/0.76/ 0.53 in FDDB/AFW/
BTSD/Caltech plate/INRIA Person datasets) comparable to state-of-the-art
algorithms
• Energy-efficient hardware architecture based on rigid boosted templates for low
power of 22.5mW and low energy per pixel of 0.54 nJ/pixel
Many object detection algorithms have been using the classification models that
are trained on features instead of pixels (Viola et al. (2001), Felzenszwalb et al.
(2009), Mathias et al. (2014), Li et al. (2015), Yang et al. (2017), Ranjan et al.
(2017)). Hand-crafted features such as the well-known HOG have been traditionally
used in object detection including the Viola-Jones algorithm (Viola et al. (2001)),
DPM (Felzenszwalb et al. (2009)), and HeadHunter model (Mathias et al. (2014)).
Recently, learned features such as convolutional neural networks (CNNs) have been
widely used (Li et al. (2015), Yang et al. (2017), Ranjan et al. (2017)). In general,
the CNN learned features outperform the hand-crafted features for object detection
accuracy and the hand-crafted features are more energy-efficient than the learned
features for hardware implementations. Reference (Suleiman et al. (2017a)) shows
10
the comparison results between two chips: (Suleiman et al. (2016)) implements the
hand-crafted feature using HOG, and (Chen et al. (2016)) implements the learned
feature using CNN. Although learned features can reportedly achieve more than 2
average precision (∼30 vs. ∼65), the accompanying energy consumption per pixel
becomes four orders of magnitude higher than that using HOG features. In this
work, we employ the HeadHunter model based on rigid templates (Mathias et al.
(2014)), which achieves state-of-the-art face detection accuracies on AFW (Zhu and
Ramanan (2012)), FDDB (Jain and Learned-Miller (2010)), and Pascal VOC (Ever-
ingham et al. (2011)) datasets compared to other works (Felzenszwalb et al. (2009),
Li et al. (2015)). Our ASIC accelerator is based on a strong multi-channel including
6 HOGs and 3 LUV and multi-scale model with rigid boosted templates (Mathias
et al. (2014)), which can detect objects by performing integral of random rectangular
regions based on the trained models. We designed a 2,000-stage classifier, where the
number of stages used for classification can be adaptively controlled depending on
the content of the search window, and can be implemented with a more modular
hardware, compared to classification with SVM and DPM (Takagi et al. (2014), Jeon
et al. (2015), Suleiman and Sze (2016), Suleiman et al. (2017b)). Embodying these
unique features for comprehensive object detection, an integrated accelerator chip
was fabricated in 65nm CMOS to demonstrate real-time programmable object detec-
tion. Multi-class object detection is illustrated in Fig. 2.1, including the measurement
results (localized objects) from the prototype chip. Power consumption is further op-
timized through configurable search stride and re-use of integral computation results
for overlapping search windows.
This chapter is organized as follows. Chapter 2.2 explains the HeadHunter al-
gorithm in detail. We introduce the system architecture of the proposed hardware
accelerator including detailed features of main modules and hardware optimization
11
Figure 2.1: Illustration of multi-class object detection (e.g., face, traffic sign, car
license plate, pedestrian) with 10 channels, 17 scales, 2000 weak classifiers, and non-
maximum suppression.
techniques in Chpater 2.3. Chapter 2.4 presents the proposed algorithm adaptations
that were employed to improve the hardware efficiency. The chip implementation and
evaluation results are described in Chapter 2.5. We conclude this work in Chapter
2.6.
2.2 Overview of the Object Detection Algorithm (HeadHunter Model)
A HeadHunter model is proposed in (Mathias et al. (2014)) using a small set of
rigid templates (i.e., without deformable parts), which reported state-of-the-art face
detection accuracies on AFW (Zhu and Ramanan (2012)), FDDB (Jain and Learned-
Miller (2010)), and Pascal VOC (Everingham et al. (2011)) datasets. This model has
four main features: (1) using multiple channels including 7 HOG channels and LUV
12
Figure 2.2: Multiple channel features with six HOGs, a gradient magnitude, and
LUV color space.
Figure 2.3: The concept of fast computation for an area sum using integral image.
color channels, (2) employing integral channel detector for fast feature computation,
(3) 2,000 Adaboost weak classifiers containing shallow boosted trees of depth two
(three stumps per tree), and (4) combining a set of rigid templates instead of using a
single template per object category.
1. Multi-Channel features: Fig. 2.2 shows the multiple channel features employed
in the HeadHunter algorithm, including LUV color channels and 7 HOG features
(1 gradient magnitude and 6 quantized orientations). Features are extracted
from the input image using integral pixel computation, as shown in Fig. 2.3.
(Mathias et al. (2014)) reported that the color channel information improves
detection accuracy compared to the case of only using HOG channels, since
certain objects (especially faces) have a discriminative color distribution. In
addition, (Dolla´r et al. (2009)) showed that LUV color channels improved better
accuracy comparing to other color channels such as grayscale, RGB, and YUV.
13
2. Integral channel detector: The use of an integral image as summed area table
was first proposed in Viola-Jones algorithm (Viola et al. (2001)). This idea is
examined by the integral channel feature framework in (Dolla´r et al. (2009)).
Integral data at (x,y) represents the sum of all the pixels above and to the
left and then any rectangle features can be computed very rapidly using an
intermediate representation for the image, as shown in Fig. 2.3.
3. Adaboost weak classifiers: A number of weak classifiers can be boosted to build
a strong classifier. In this work, we employ 2,000 Adaboost weak classifiers
for a robust system inspired by (Mathias et al. (2014)). Fig. 2.4 shows the
concept of the classifier operation. The 2,000 weak classifiers use pooling over
rectangular regions as features. Each weak classifier computes this pooling
operation and the 1st node compares with a given threshold to decide which
of the two 2nd nodes should be computed. Depending on the 2nd node result,
the weight corresponding to the classifier is either added or subtracted from the
final score. After computing 2,000 weak classifiers, the final score is compared
with a configurable threshold to determine if the search window has an object.
4. Rigid boosted templates: A rigid template approach can achieve high-speed
object detection, but less detection accuracy, compared to DPM which has
high computational cost (Suleiman et al. (2016)). HeadHunter model combined
a small set of rigid templates that are separately used to capture intra-class
diversity of objects, which can be boosted to build a strong detector. In our
proposed hardware accelerator, we can use up to five different templates due to
a limit on the on-chip memory size.
The training dataset employed for face detector is the AFLW dataset (Koestinger
et al. (2011)), from which cropped faces are used as positive samples. For negative
14
Figure 2.4: Conceptual operation of 2,000 weak classifiers.
samples, random images from the Pascal VOC dataset (Everingham et al. (2011))
that do not have any person were used. The other training datasets such as traffic
sign data, car license plate, and pedestrian are collected and labeled by the authors
in a custom manner. During the training procedure, the object detection model
first randomly generates a large feature pool and selects the best weak classifier on
samples, and then increases the weight for difficult samples in each round. After
all the stages of the detector are generated, it further collects the difficult negative
samples to perform bootstrap training.
Each weak classifier contains a two-level decision tree for each of the five trained
models: one frontal object model, two side views and two mirrored models. The input
image is first scaled with scaling factors ranging from 0.2 to 3 to enable detection of
various sizes of objects. All five trained models are evaluated separately for a sliding
window that sweeps the entire image. The outputs of all weak classifiers are combined
15
Figure 2.5: High-level pseudo-code of the overall object detection operation (left)
and corresponding modular nested structure on hardware (right).
and compared with a threshold to allocate the bounding box for an object along with
a score. The bounding boxes from all the scales are passed through a non-maximum
suppression (NMS) stage, which selects one box with the highest score, and removes
other redundant overlapping ones. High-level pseudo-codes of the object detection
algorithm that we implement and the modular hardware structure are shown in Fig.
2.5.
2.3 Energy Efficient Hardware Architecture Based on Rigid Boosted Templates
2.3.1 Hardware Architecture and Operation
Fig. 2.6 shows the top-level block diagram and data flow for the model architecture
in Fig. 2.5. To achieve high accuracy, the classifier has five trained models, each with
2,000 weak classifiers, which can consume significant time and energy in the model
evaluation.
16
1. Scale function
We use the search window size of 80×80 pixels, and detect objects of various
sizes by scaling the input image from 0.4× to 2.0×, with a step size of 0.1×.
Bilinear interpolation method is used to cover such wide range of scales. Each
pixel in the scaled image is computed from four pixel values in the input image,
which are stored into on-chip frame buffer. A 3×3 Gaussian smoothing filter
is applied on the scaled image using three line buffers. Note that we support
up-scaling up to 2.0× for robust detection, which makes the SRAM size to be
186.5KB, a 3.7× increase compared to the case of only supporting down-scaling.
2. Channel generation
This method uses 10 feature maps consisting of seven HOG channels (1 gradient
magnitude and 6 quantized orientations) and LUV color space channels. The
quantized orientation of HOG is a weighted histogram where the gradient angle
and magnitude determine the bin index and the weight, respectively, as shown
in the following equation:
Qθ(x, y) = G(x, y) · 1[Θ(x, y) = θ], (2.1)
where G(x, y) and Θ(x, y) are the gradient magnitude and quantized gradient
angle, respectively, at I(x, y) (Dolla´r et al. (2009)).
Figure 2.6: Top-level block diagram and the end-to-end data flow of proposed object
detection accelerator.
17
Figure 2.7: Block diagram and data flow of classifier operation.
Piecewise linear approximation is used for complex non-linear computations
such as square and cube root. 7-bit precision is used for channel data. Channels
are then down-sampled by 4 and stored in SRAM blocks. To reduce the on-chip
memory size, we propose a compression method for six HOG features, such that
we reduce the number of SRAMs from 10 to 5 SRAMs (details in Section III-B).
Note that all processes such as generating, down-sampling, storing, and loading
for 10-channel feature data are executed in parallel.
3. Integral function
Integral images defined over the 10 channels are used for fast summation over
random rectangular pooling regions. A key concern of the integral function
scheme in terms of hardware implementation is that a huge memory is needed
to store integral data. For example, we need a SRAM size of 234.4KB for 8-bit
precision data of QVGA (320×240) image to store an entire of integral data. To
reduce the memory size, we propose that integration is performed over 12×10
windows and integral data are stored within 160 (whole horizontal pixel)×32
size, instead of an entire size of 160×120 (details in Chapter 1.3.2).
18
4. Classifier operation
Fig. 2.7 shows the block diagram and data flow of the classifier operation.
The trained data of five different templates, each with 2,000 weak classifiers
are stored in SRAM. One of 10 SRAMs that store 10-channel integral data is
selected by the channel information given by the trained data, which means
that the 10-channel integral data should be ready altogether and be accessible
from the 10 SRAMs. The two row data in the selected SRAM are loaded
according to the coordinate information from the trained data to use pooling
over rectangular regions as the feature. A Classifier Engine (CE) computes
the area of the rectangular region, and adds or subtracts weights according to
the results by comparing the area with a threshold value given by the trained
model. One hundred forty one CE modules compute the weak classifier for all
horizontal search windows in parallel. After computing five rigid templates,
the classifier operation is iterated over different vertical locations. During the
detection process, all five templates are evaluated over each search window and
their results are combined using NMS.
5. NMS function
Multiple scales, sliding windows and five different templates result in a cluster
of detections around a single object. NMS method is used to select the best de-
tection and remove the redundant ones. In this work, we decided the maximum
of detectable objects per image to be 50, balancing the NMS computation time.
All detection results are sorted based on their scores. If the overlap is greater
than a 0.3 (adopted from (Mathias et al. (2014))), then the detection was sup-
pressed. After sorting the values from all scales and templates, post-NMS result
is used as the final bounding box of the detected object in the image.
19
(a)
(b)
Figure 2.8: Illustration of the down-sampling and storage of generated channel data
in (a) the baseline scheme and (b) the proposed scheme.
2.3.2 Hardware Optimization Techniques
We propose an adaptive pooling scheme when we perform down-sampling by 4 af-
ter channel generation in order to reduce SRAM size. The baseline algorithm (Mathias
et al. (2014)) adopted 4×4 average pooling for the down-sampling and 10 channels are
stored into SRAM as shown in Fig. 2.8(a). As illustrated in Fig. 2.8(b), we proposed
a compression technique for six HOG values for the accelerator. Based on (1), the
six HOG channel values are the gradient magnitude value or zero according to the
quantized gradient angle. In other words, one of six HOGs is non-zero while the rest
20
of the five HOGs are zero at the same pixel location. Based on this HOG feature,
the six HOG values can be replaced to the index value indicating the non-zero HOG
channel after down-sampling, following the computation in (1.2).
Index(x, y) = argmax
j
[
4∑
x,y=1
HOGj(x, y), j = 1...6
]
, (2.2)
The other four channels are down-sampled by 4 with average pooling. The index
value and the data of four channels are then stored at SRAM. This reduces the SRAM
size for storing channel data by ∼2× without any degradation of accuracy. The data
of 6 HOG channels can be reproduced through the decoder with index value from
SRAM, as described in (1.3):
HOGj(x, y) = G(x, y) · 1[Index(x, y) = j], (2.3)
where G(x, y) and Index(x, y) are the gradient magnitude and the index value,
respectively, at I(x, y).
In addition, to reduce the number of bits in the integral data, integration is
performed over 12×10 windows. When pooling over a 20×20 window, the offset
from the previous integral window is added to get the correct result. An example is
illustrated in Fig. 2.9. We can obtain the correct integral data at location 4 with
three appropriate offset values at location 1, 2, and 3. The values of the integral
image at location 1, 2, 3, and 4 are the sum of the pixels in rectangle A, B, C, and
D, respectively. The correct integral data at location 4 for 20×20 window can be
computed as A+B+C+D. By using a window size of 12×10 for generating integral
channel data, the number of bits used for integral data is reduced to 14 bits (22 bits
are required when integrating over the entire image), reducing the SRAM size by
36%.
21
Figure 2.9: Illustration of obtaining correct integral data over 20×20 window with
12×10 window of integration.
Furthermore, a pre-processing step for the NMS function was introduced. There
are 17 scales to process and each scale has a very large number of search windows
that produce object detection results. To alleviate the large memory requirement
to store such many results, while sliding the search window in each scaled image,
we directly remove redundant boxes of the detected object within specific ranges as
shown in Fig. 2.10. This reduces the computation time and SRAM size for NMS
function by 14-89× depending on the pixel stride (1-3). To simplify the computation,
we determined the fixed overlap ratio threshold for each scaled image called intra-
scale overlap threshold to be a value (0.25) that minimally degrades AP based on our
experimental results. On the other hand, after completing the pre-processing of NMS
for the entire 17 scales, we perform NMS function to remove overlapping detection
boxes with a configurable inter-scale overlap threshold parameter.
22
Figure 2.10: Pre-processing step for the NMS function is illustrated. The largest
detection result is stored at local registers while sliding the search window within
30×30 pixels, such that 120 detection results that have overlap greater than a 0.25
are suppressed.
Figure 2.11: Data re-use and parallel computing scheme for multiple adjacent search
windows.
Finally, instead of computing different weak classifiers in parallel, we compute a
single weak classifier across multiple windows in parallel. As shown in Fig. 2.11,
this re-uses data that are overlapped among adjacent search windows, reducing the
number of memory access by 77× in average for 17 scales.
2.4 Algorithm Adaptiations for Hardware Efficiency
As described in Section II, HeadHunter model based on a set of rigid templates
with Adaboost weak classifiers can be implemented with a more modular hardware.
23
Figure 2.12: Weight reordering and adaptive classification. If the intermittent sum
is larger than upper threshold (left) or smaller than lower threshold (right), the re-
maining classifier operations are skipped. Otherwise, 2000 classifiers are computed
(middle).
We employ five rigid templates in our hardware accelerator and have five trained
models for face detection. On the other hand, we only have one trained model for
other object classes, such as traffic sign, car license plate, and pedestrian. We pro-
pose a multi-class object detection method using five rigid templates. When using
five different types of trained models for different object classes through five rigid
templates, we can detect up to five different object classes simultaneously. Since we
can use five different rigid templates for different types of object classes instead of
using a set of rigid templates for single object class, the proposed method can detect
multiple object classes at the same time without any hardware redundancy, in con-
trast to (Takagi et al. (2014), Suleiman et al. (2017b)). The architectures in (Takagi
et al. (2014), Suleiman et al. (2017b)) have two classifier engines to detect two object
classes. In this work, since we have only four types of trained models for face, traffic
sign, car license plate, and pedestrian, we are capable of detecting four object classes
at the same time.
In addition, we employ 2,000 Adaboost weak classifiers to build a strong classifier
similar to (Mathias et al. (2014)). The experimental results in (Mathias et al. (2014))
described that 83.35% and 85.57% average precision were obtained with 200 weak
classifiers and 2,000 weak classifiers, respectively. To reduce the computation load
24
from the large number of weak classifiers with less degradation in the detection accu-
racy, we propose two efficient techniques: adaptive cascading and weight re-ordering,
as shown in Fig. 2.12. Adaptive classifier cascading is proposed to dynamically scale
the amount of classifier computation based on input images. We intermittently check
the sum of classifiers with a configurable subset of 2,000 classifiers (e.g., 400 as shown
in Fig. 2.12) whether it is higher than a conservative upper threshold or smaller
than a lower threshold value, in which case the true or false object detection result is
determined without going through 2,000 classifiers. After going through a subset of
classifiers, if the intermediate result in a search window is strongly positive or negative
compared to the object threshold, the remaining classifier operations are skipped. In
weight re-ordering, based on our proposed adaptive classifier cascading scheme, the
weak classifiers with higher weight values are computed first. This helps the inter-
mediate result to reach a strongly positive or negative value earlier, and therefore
we can expedite the detection of an object. The proposed techniques achieved 5.5×
speed-up while having less than 1% degradation in the average precision.
Furthermore, we employed a number of configurable parameters in the algo-
rithm and the implemented hardware, in order to show the trade-offs of perfor-
mance/accuracy and power. These include (1) the number of different scales (up
to 17) and various scale factors (0.4× to 2.0× with as low as 0.1× step), (2) pro-
grammable horizontal and vertical stride (1-3 pixels) for the sliding search window,
(3) threshold for object classification, and (4) variable inter-scale overlap ratio for
NMS (0.25-0.55).
25
2.5 65nm Implementation Results
The proposed ASIC accelerator was implemented in 65nm CMOS. The chip mi-
crograph is shown in Fig. 2.13, where the total area is 3.1×2.8 mm2, including the
input image buffer. Fig. 2.14 shows the output of the prototype chip that demon-
strates multi-scale multi-object detection for face, traffic sign, car license plate and
pedestrian detection, where bounding boxes (measured chip outputs) are drawn on
top of the input image to localize the detected objects. The chip specifications are
summarized in Table 2.1.
Fig. 2.15 shows the prototype chip measurement environment and system that
was used to evaluate real-time object detection. It is composed of the custom PCB
that mounts the 65nm prototype chip, a FPGA board, a HDMI interface board, and
a LCD display. Our prototype chip performs end-to-end object detection, where it
takes an input video data and outputs the video data enclosing a detected object with
a final bounding box. All the image processing and computations for object detection
Figure 2.13: 65nm prototype chip micrograph.
26
Figure 2.14: Chip measurement results of multi-scale multi-object detection on face
and traffic sign images.
are done in the prototype chip. We only use the FPGA board to configure the chip
and read information of detected object such as coordinate and score to evaluate the
accuracy.
Note that we down-sampled higher-resolution images (up to full HD 1920×1080)
to QVGA (320×240) to store an entire of input frame image in the on-chip frame
buffer instead of using external storage such as DRAM. Then, on-chip QVGA input
frame buffer was used to scale images on-the-fly for 17 scales and iteratively compute
the same sliding window. In other words, our chip demonstrated object detection
for full HD resolution images with down-sampling as a pre-processing step. Since a
down-sampled pixel is only read in the pre-processing step while the full HD videos
is transmitted in a row raster scan order, no extra process such as interpolation is
required. An alternative would be to use a single-size image and scale the sliding
window for a number of scales. This method will have a smaller on-chip frame buffer,
but will require a larger memory for trained models that increases with the number
of scale factors. Performing a fine-grain search on a lower-resolution image is more
27
Table 2.1: Chip Specifications
Technology 65nm CMOS
Chip size 3.6×3.3 mm2
Core size 3.1×2.8 mm2
SRAM 339.9 KB
Frame buffer 225 KB (SRAM)
Input resolution 1920×1080
Supply voltage 0.58-1.1 V
Clock frequency 100 - 250 MHz
Frame rate 20 - 50 fps
Power 22.5 - 181.7 mW
Energy 0.54 - 1.75 nJ/pixel
favorable than a coarse-grain search on a high-resolution image, due to the reduction
in image sensor power and data communication.
To characterize the object detection accuracy, performance, and power consump-
tion, we used the AFW and FDDB database (Zhu and Ramanan (2012), Jain and
Learned-Miller (2010)) for face detection, the BTSD database (Timofte et al. (2014))
for traffic sign detection, Caltech database (Caltech (2001)) for car license plate de-
tection, and INRIA database (INRIA (2005)) for pedestrian detection. The measured
chip performance (frames per second) and total/leakage power with dynamic voltage
scaling are shown in Fig. 2.16. Full object detection functionality was verified down
to 0.58V, where the chip performs real-time detection at 20.1 fps with 22.5mW power.
In Table 2.2, the power breakdown in logic and memory at the nominal voltage as
1.0V is detailed for four different chip configurations, where the number of scales, pixel
stride, and the classification stage are varied to check intermediate sum for adaptive
cascading. With regards to voltage scaling, the power/energy values in Table 2.2 and
2.3 also scales down in a similar manner that is reported in Fig. 2.16.
28
Figure 2.15: System test environment.
Fig. 2.17 shows the precision versus recall (PR) curves (Davis and Goadrich
(2006)) of the prototype chip measured for AFW, FDDB, BTSD, Caltech car license
plate, and INRIA person datasets. The average precision (AP) can be computed as
the area under the PR curve. We achieved AP of 0.876 and 0.806 for the FDDB and
AFW datasets for face detection, respectively. For traffic sign detection, we achieved
AP of 0.72 for the BTSD dataset. We achieved AP of 0.763 for the Caltech dataset
for car license plate detection. For pedestrian detection, we achieved AP of 0.541 for
the INRIA dataset. Since our proposed system supports input image resolution up to
full HD (1920×1080) with down-sampling into QVGA (320×240) as a pre-processing
step, images that are over full HD size in the AFW and BTSD datasets are cropped
to full HD size. However, note that we used the original annotation data of AFW and
29
Figure 2.16: Measured frame rate and total/leakage power with voltage scaling.
BTSD datasets in our AP measurements. In other words, we counted the number
of objects that were not detected as false negatives due to truncated or lost objects
after cropping the images.
Fig. 2.18 shows the AP values for the FDDB datasets with various stage number
when using our proposed adaptive classifier cascading methods. We achieved AP of
0.862 with 200 stages in the adaptive cascading scheme, which is only 0.85% degra-
dation in the average precision comparing to the AP of 0.869 with 2,000 stages (i.e.,
without the adaptive cascading scheme). Note that this AP degradation represents a
∼3× reduction (0.85% vs. 2.22%) compared to the experimental results of (Mathias
et al. (2014)). In addition, the AP measured results with the detection quality versus
number of scale factors is shown in the Fig. 2.19. We achieved the similar accuracy
as AP of 0.862 in the nine scale factors from 0.4× to 2.0×, with a step size of 0.2×,
comparing to the all (17) scale factors. For the six scale factors, we achieved AP of
30
0.843 in the FDDB dataset with a small amount of degradation. However, the AP
value decreased somehow when using the five scale factors, and especially, there is
significant deterioration in the four scale factors using only down-scaling. Table 2.3
summarized the measurement results of delay time, power, and energy with the dif-
ferent number of weak classifier stages in our proposed adaptive cascading technique.
Comparing to our system without adaptive cascading skill, the proposed adaptive
cascading method with 200 weak classifier stages reduced the total delay time of sys-
tem by 5.5× and achieved 16.3% power reduction. Our proposed accelerator using
adaptive cascading method reduces the overall system energy consumption by 6.6×.
Table 2.2: Power Breakdown with Various Configurations
Config1 Config2 Config3 Config4
Number of scales
17 8 8 8
(0.4-2.0×) (0.4-2.0×) (0.4-1.5×) (0.4-1.5×)
Pixel stride1 1, 2, 3 1, 2, 3 1, 2, 3 Max
Adaptive stage2 500 500 400 400
Logic power3
20/120 20/110 20/93 20/85.5
(a)/(b) (mW) @ 1.0V
SRAM power4
20/48 20/43 20/37 20/34
(a)/(b) (mW) @ 1.0V
Total power
215 193 170 159.5
(a)/(b) (mW) @ 1.0V
Frame rate (fps) 10.7 22.7 30.3 39.5
1(1,2,3): pixel stsride pre-configured 1 - 3 based on scale (small→large)
(Max): pixel stride is 2 for horizontal, 3 for vertical
2Classification stage when sum is compared with upper/lower threshold
3,4(a): pre-processing of image, (b): integral and classification processing
31
Table 2.3: Delay Time, Power, Energy Versus Different Number of Stages in Adap-
tive Cascading
200 300 500 1000 1500 2000
stages stages stages stages stages stages
Delay time (ms) 22.9 28.5 39.8 63.6 98.3 127.1
Power (mW) @ 1.0V 156.6 170.5 172.6 184.3 185.2 187.1
Energy (nJ/pixel) 1.73 2.34 3.31 5.65 8.78 11.47
Furthermore, through 2× up-scaling, our design can detect objects as small as
40×40 pixels, which is much smaller than the detectable objects in previous works
(Takagi et al. (2014), Jeon et al. (2015), Suleiman and Sze (2016), Suleiman et al.
(2017b)). Fig. 2.20 shows the comparison between the designs when only down-
scaling (0.4-1.0×) was used and when both up-scaling and down-scaling (0.4-2.0×)
are used. Up-scaling improves the AP significantly at the expense of moderate mem-
ory/power increase.
Fig. 2.21 shows the area and measured power breakdown of the prototype chip.
63% of the total chip area is occupied by on-chip SRAM arrays, due to the requirement
to store the trained models, integral data, input image frame buffer, etc. On the other
hand, 66% of the total chip power was consumed by logic components due to high
activity factors, where the power of the classifiers (56% of chip power) dominated.
Table 2.4 shows the comparison with hand-crafted features based object detec-
tion accelerators (Takagi et al. (2014), Jeon et al. (2015), Suleiman and Sze (2016),
Suleiman et al. (2017b)). The architecture in (Jeon et al. (2015)) achieved low power
consumption similar to this work, but the energy per pixel value is much higher
than this work due to the lower image resolution and frame rate. The implementa-
tion in (Suleiman and Sze (2016)) achieved low energy per pixel number with high
32
Figure 2.17: Measured precision1 versus recall2 curve with multiple object classes.
1Precision: (true positive) / (true positive + false positive)
2 Recall: (true positive) / (true positive + false negative)
frame rate, but it was post-layout results. In addition, the accuracy in (Suleiman
and Sze (2016)) is lower than this work. Two object detection accelerators are pre-
sented in (Takagi et al. (2014)) and (Suleiman et al. (2017b)). Both accelerators
process full HD videos in real-time and support multiple object detection similar to
this work. However, our proposed accelerator employs color based LUV channels and
fine-grained up-scaling, which increase the detection accuracy and robustness, while
achieving 60% and 42.5% energy/pixel reduction compared to (Takagi et al. (2014))
and (Suleiman et al. (2017b)), respectively. Note that our work evaluated AP across
multiple datasets for multiple object classes, the most among any prior works (Takagi
et al. (2014), Jeon et al. (2015), Suleiman and Sze (2016), Suleiman et al. (2017b)).
33
Figure 2.18: Precision-recall curves on the FDDB datasets for the different number
of weak classifiers in our proposed adaptive classifier cascading.
In addition, our proposed accelerator is compared with CNN-learned feature based
object detection accelerators (Lee et al. (2016), Yin et al. (2017), Chen et al. (2018c)).
Note that we calculated the energy per pixel numbers based on the energy efficiency
numbers in [27-29]. The reference (Lee et al. (2016)) proposed an advanced driver-
assistance system (ADAS) processor that achieved 0.862 TOPS/W with a 4-layer
recurrent neural network (RNN) connected to a fuzzy inference system (FIS), but
the energy per pixel value is 28× higher than our accelerator. Two CNN processors
for object detection are presented in (Yin et al. (2017), Chen et al. (2018c)). Both
processors implemented YOLO CNN (Redmon and Farhadi (2017)), which is a rep-
resentative end-to-end object detection CNN model. As a reconfigurable hybrid-NN
processor, Thinker (Yin et al. (2017)) achieved 1.26 TOPS/W for YOLO V2, but the
34
Figure 2.19: Precision-recall curves on the FDDB datasets for the various number
of scale factors.
energy consumption per pixel of our work is 216× less than that of Thinker. The
CNN design in (Chen et al. (2018c)) achieved high energy efficiency of 2.2 TOPS/W
and good accuracy of 0.6 mAP for VOC 2007 and VOC 2012 (Everingham et al.
(2011)) datasets. However, due to the lower image resolution (416×416) the energy
per pixel is 143× and 26× higher than our proposed work.
2.6 Conclusion
In this chapter, we presented a 65nm accelerator for real-time programmable ob-
ject detection. The accelerator employed HeadHunter model based on a set of five
rigid templates with 2,000 Adaboost weak classifiers. A large number of classifiers are
used to make a strong object classification, and adaptive cascade is realized for dy-
35
(a)
(b)
Figure 2.20: (a) Design comparison using up-/down-scaling. (b) Measurement re-
sults show that smaller faces can be detected through up-scaling.
namic computation scaling. High average precision of 0.88, 0.81, 0.76, 0.72 and 0.54
was achieved in FDDB, AFW, Caltech car plate, BTSD, and INRIA person datasets,
respectively, by using integral channel features on 7 HOG and 3 LUV channels, 17
scale factors with 6 down-scaling and 11 up-scaling, configurable thresholding, adap-
tive cascading classification, and optimal non-maximum suppression. The accelerator
achieved 0.54/1.75 nJ/pixel while consuming 22.5/181.7 mW at 0.58/1.1V with 20/50
fps in full HD videos, respectively. The hardware optimization techniques reduced on-
36
(a)
(b)
Figure 2.21: (a) Area and (b) power breakdown of the overall system.
chip SRAM size by overall 2.9×. Our proposed adaptive classifier cascading method
achieved an overall 6.6× energy per pixel reduction. The capability of programmable
and voltage-/performance-scalable many-object detection will enhance smart vision
processors in ubiquitous mobile systems.
37
T
a
b
le
2
.4
:
C
om
p
ar
is
on
to
P
ri
or
A
S
IC
W
or
k
s
on
O
b
je
ct
D
et
ec
ti
on
B
a
se
d
o
n
h
a
n
d
-c
ra
ft
e
d
fe
a
tu
re
s
B
a
se
d
o
n
C
N
N
-l
e
a
rn
e
d
fe
a
tu
re
s
T
h
is
w
o
rk
T
a
k
a
g
i,
2
0
1
4
J
e
o
n
,
2
0
1
5
S
u
le
im
a
n
,
2
0
1
6
S
u
le
im
a
n
,
2
0
1
7
L
e
e
,
2
0
1
6
Y
in
,
2
0
1
7
C
h
e
n
,
2
0
1
8
C
M
O
S
T
e
ch
.
6
5
n
m
4
0
n
m
4
5
n
m
S
O
I
6
5
n
m
6
5
n
m
6
5
n
m
5
5
n
m
6
5
n
m
C
h
ip
si
z
e
(m
m
2
)
3
.3
×1
.2
2
.5
8
×2
.2
7
2
.8
×0
.9
6
3
.5
8
×3
.5
8
4
×4
3
.8
×3
.8
3
.3
×3
.1
3
.1
×2
.8
Im
a
g
e
re
so
lu
ti
o
n
F
u
ll
H
D
H
D
F
u
ll
H
D
F
u
ll
H
D
2
2
4
×2
2
4
4
4
8
×4
4
8
4
1
6
×4
1
6
F
u
ll
H
D
C
h
a
n
n
e
l
F
e
a
tu
re
9
H
O
G
1
9
H
O
G
9
H
O
G
R
N
N
-F
IS
Y
o
lo
V
2
Y
o
lo
V
2
/
ti
n
y
7
H
O
G
+
3
L
U
V
#
o
f
sc
a
le
s
si
n
g
le
si
n
g
le
1
2
(a
ll
d
o
w
n
)
1
2
(a
ll
d
o
w
n
)
si
n
g
le
si
n
g
le
si
n
g
le
1
7
(6
d
o
w
n
1
1
u
p
)
C
la
ss
ifi
e
r
S
V
M
2
2
-s
ta
g
e
c
a
sc
a
d
e
S
V
M
S
V
M
D
P
M
R
N
N
C
N
N
C
N
N
2
0
0
0
-s
ta
g
e
c
a
sc
a
d
e
O
b
je
c
t
c
la
ss
e
s
2
1
1
2
-
-
-
4
A
c
c
u
ra
c
y
(A
P
)
F
1
=
9
5
%
(G
T
I)
F
1
=
9
3
%
(c
u
st
o
m
d
a
ta
se
t)
0
.3
7
(I
N
R
IA
)
0
.2
6
(V
O
C
2
0
0
7
)
-
-
0
.5
9
6
(V
O
C
2
0
0
7
/
2
0
1
2
)
0
.8
8
(F
D
D
B
)
0
.8
1
(A
F
W
)
0
.7
2
(B
T
S
D
)
0
.7
6
(C
a
lt
e
ch
)
0
.5
4
(I
N
R
IA
)
F
ra
m
e
ra
te
(f
p
s)
3
0
5
.5
6
0
3
0
-6
0
3
0
1
2
.0
5
5
.5
/
2
7
.7
2
0
5
0
P
o
w
e
r
(m
W
)
8
4
(@
0
.7
V
)
2
3
(@
0
.6
V
)
4
5
.3
(@
0
.7
2
V
)
5
8
.6
-2
1
6
.5
(@
0
.7
7
-1
.1
1
V
)
3
3
0
(@
1
.2
V
)
2
8
0
6
8
(@
1
.1
V
)
2
2
.5
-1
8
1
.7
(@
0
.5
8
-1
.1
V
)
E
n
e
rg
y
(n
J
/
p
ix
e
l)
1
.3
5
4
.5
0
.3
6
0
.9
4
-1
.7
4
1
5
.3
1
1
6
.6
7
7
.3
7
/
1
4
.2
0
.5
4
-1
.7
5
38
Chapter 3
PRECISION-CASCADING BASED HARDWARE ACCELERATOR FOR DEEP
CONVOLUTIONAL NEURAL NETWORK
This chapter proposes an energy-efficient hardware implementation for deep con-
volutional neural networks (DCNNs) accelerator. We investigated efficient custom
hardware chip design of state-of-the-art CNN algorithms, which are very accurate
but require up to hundreds of megabytes for data storage and billions of operations
for a single inference pass. To reduce computation without accuracy degradation, we
proposed an energy-efficient CNN accelerator based on a novel conditional comput-
ing scheme, which we call precision-cascading (PC) and integrates convolution with
subsequent max-pooling operations. In particular, we divide the input features into
a group of precision values and first perform approximate convolution computations
with only the most significant bits (MSBs) of the feature data. Based on this ap-
proximate computation, we find the maximum value for the pooling output, and if
the maximum value cannot be found, the approximate convolutions are computed
in a cascaded manner. Then the full-precision convolution is performed only on the
maximum pooling output that is found. This way, the total number of bit-wise con-
volutions could be reduced by ∼2×, without affecting the output feature values and
with < 0.8% degradation in final classification accuracy. In addition, we have been
developing an optimized dataflow that exploits sparsity, maximizes data re-use and
minimizes off-chip DRAM access, which can improve upon existing hardware works
(Chen et al. (2016), Albericio et al. (2016)). The total DRAM access can be saved by
2.12× when applying our proposed energy-efficient data flow. Preliminary results of
the proposed DCNN accelerator achieved a peak 8.88 TOPS/W for VGG-16 in 40nm
39
CMOS technology, excluding external DRAM access.
3.1 Introduction
While deep learning algorithms have been successfully used in many practical
applications, accurate, fast, and low-power hardware implementations of such al-
gorithms is still a challenging task, especially for mobile systems such as Internet
of Things (IoT), autonomous vehicles, and smart drones. Hardware designs using
general-purpose processors such as CPUs and GPUs do not provide satisfactory en-
ergy efficiency. This is due to (1) high computational complexity that varies with
algorithms and (2) the large memory/communication requirement independent of in-
put, which generates significant data movement that can be as energy consuming
as computation. Therefore, many prior works have focused on flexible application-
specific integrated circuits (ASICs) to address these challenges. It is crucial to design
a computing scheme that can support high parallelism and optimize the data flow
and communication. The underlying hardware needs to be reconfigurable to support
various models and should reduce the amount of computation data movement de-
pending on the input data. To further improve energy efficiency, optimized memory
hierarchy and data sparsity statistics can also be exploited.
Previous work has proposed special-purpose ASICs for deep convolutional neu-
ral network (DCNN) accelerator (Chen et al. (2016), Moons et al. (2017), Lee et al.
(2019)). Eyeriss (Chen et al. (2016)) proposed a spatial architecture with row sta-
tionary data flow to minimize data movement energy cost for any CNN shape and
employed the run-length compression that exploit the statistics of zero data. How-
ever, The accelerator in (Chen et al. (2016)) achieved 245.6 GOPS/W in VGG-16
that is much less than the state-of-the-art architecture (Moons et al. (2017), Lee
et al. (2019)). Envision (Moons et al. (2017)) proposed a dynamic-voltage-accuracy-
40
frequency scaling scheme and dynamic precision technique with modulating the body
bias. LNPU (Lee et al. (2019)) proposed a fine-grained mixed precision and zero skip-
ping with sparse encoding based on run-length compression. Both architectures in
(Moons et al. (2017), Lee et al. (2019)) achieved high energy-efficient of 2 TOPS/W
and 5.84 TFLOPS/W in VGG-16, respectively. However, both architecture (Moons
et al. (2017), Lee et al. (2019)) do not evaluate their implementations on off-chip
memory energy cost.
In this chapter, we have implemented a DCNN accelerator that can support high
throughput DCNN inference and optimize for the energy efficiency of the entire sys-
tem, including the accelerator chip and off-chip memory. It is also reconfigurable to
handle different DCNN shape. The main features of our proposed accelerator are as
follows.
• A novel conditional computing scheme, called precision-cascading (PC), that is
to reduce many redundant convolution operations integrating the subsequent
max-pooling operation. Zero-skipping (ZS) scheme with PC scheme further re-
duces the convolution operations in an orthogonal way by skipping computation
of the large number of zero input features.
• A special architecture consisting of 3 × 3 of main processing elements (PEs)
array that can fully skip both computation and read of the large number of zero
activations.
• A Convolution loop acceleration strategy can minimize computing latency and
the access number of on-chip memory as well as off-chip memory.
• An architecture using an array of 324 PEs that can support various kernel size
such as 3 × 3, 5 × 5, and 7 × 7.
41
Figure 3.1: Precision-cascading multiplication of input feature by kernel feature.
The performance of our proposed accelerator, including both the chip energy
efficiency and required off-chip memory accesses, is benchmarked with VGG-16 (Si-
monyan and Zisserman (2014)).
3.2 Overview of the Proposed Conditional Computing Scheme
3.2.1 Precision-Cascading (PC) Scheme
Typically, a deep convolutional neural network consists of a pooling layer in-
between successive convolutional layers, which is to progressively reduce the spatial
size of the representation to reduce the amount of parameters and computation in
the network. The most common form is a pooling layer with filters of size 2×2
Figure 3.2: The conceptual operation of precision-cascading scheme.
42
applied with a stride of 2 downsamples every depth slice in the input by 2 along
both width and height, discarding 75% of the activations. As shown in Fig. 3.1, we
proposed to divide the input features into a group of precision values and first perform
approximate convolution computations with only the most significant bits (MSBs) of
the feature data. Based on this approximate computation, if convolution results of
MSB group can reveal a max, we can skip convolution operations of LSB group on
NOT maximum cases. Fig. 3.2 shows the concept of the precision-cascading scheme.
The main advantage of this scheme is that we can reduce convolution operations by
up to the number by Eq. (3.1).
ReductionRatio = 1− p× p×
1
N
+ N−1
N
p× p (3.1)
For example, we can reduce convolution operations by 50% when p=2, N=3 and
by 67% when p=3, N=4. However, total delay time should be increased by over 2×
due to iteration for finding max. In addition, this scheme can be applied on the only
convolution layer right before max pooling layer.
3.2.2 Fully Zero Skipping (ZS) Scheme Integrating Precision-Cascading
Most deep convolutional neural network perform the Rectified Linear Unit (ReLU)
that is commonly used activation function. The ReLU returns 0 if it receives any
negative input, but for any positive value it returns that value back, which makes
DCNN data contain many zeros. We proposed to load and store only non-zero value
of input feature from external memory and to on-chip memory, respectively, using
sparsity map and then compute convolution operation as shown in Fig. 3.3. The
advantage of this scheme is that we can reduce convolution operations and access
time of memory significantly due to ReLU operation. For example, we can reduce
the convolution operations by 50% in average at VGG-16. In addition, we can reduce
43
on-chip memory size because only non-zero values are stored. However, we need
additional memory to store sparsity map. In addition, there are some limits when
using parallel computation on spatial domain such as kernel window and scan within
one output feature map.
Fig. 3.4 shows that we can create a synergy when integrating PC and ZS by
reducing convolution operations and access time of memory further for all convolution
layers because there can be more zero-values when using PC scheme. We can further
reduce the computation time of convolution operation on MSB group of PC scheme
significantly using ZS scheme because MSB group have more zero-values as shown in
Table 3.1.
3.3 Proposed Convolution Loop Acceleration Strategy
Convolution is the main operation in DCNN algorithms, which involves three-
dimensional multiply and accumulate (MAC) operations of input feature maps and
convolution kernel weights. Convolution is comprised of four levels of loops as shown
in the pseudo codes in Fig. 3.5. To efficiently map and perform the convolution
Figure 3.3: The conceptual operation of fully zero skipping scheme.
44
Figure 3.4: Illustration of integrating PC and ZS.
Figure 3.5: Four levels of convolution loops, where L denotes the index of convolu-
tion layer and S denotes the sliding stride.
loops, three loop optimization techniques (Zhang et al. (2015), Bacon et al. (1994)),
namely, loop unrolling, loop tiling and loop interchange, are employed to customize
the computation and communication patterns of the accelerator with three levels of
memory hierarchy.
Loop unrolling determines the parallelism scheme of certain convolution loops,
and thus the required size of registers and PEs. Loop tiling determines the required
capacity of on-chip buffers. It divides the loops into multiple blocks, and the data
of the executing block are read from external memory and stored in on-chip buffers.
Loop interchange determines the computation order of the four loops and thus affects
the dataflow between the adjacent levels of memory hierarchy. There are two kinds
of loop interchange, namely intra-tiling and inter-tiling loop orders. Intra-tiling loop
order determines the pattern of data movements from on-chip buffer to register files
or PEs.
45
Table 3.1: Analysis Results of Zero Percentage at VGG-16 for ILSVRC2012 Valid
Set
baseline group1 group2&3
conv1 1 0.45% 56.22% 1.48%
conv1 2 48.80% 99.35% 48.81%
conv2 1 19.36% 81.13% 19.48%
conv2 2 34.11% 75.40% 34.28%
conv3 1 31.00% 61.37% 31.20%
conv3 2 48.26% 71.25% 48.41%
conv3 3 49.29% 68.09% 49.45%
conv4 1 52.45% 67.80% 52.60%
conv4 2 65.79% 81.77% 65.87%
conv4 3 73.43% 91.41% 73.49%
conv5 1 75.85% 84.77% 75.92%
conv5 2 77.29% 91.48% 77.34%
conv5 3 80.18% 97.73% 80.20%
Average 50.48% 79.06% 50.66%
We optimized unroll loop-1 and loop-3 based on our proposed architecture. Loop-
1 and loop-3 are fixed by kernel window size. We can minimize computing latency by
more unrolling loop-2 and loop-4. However, this makes PE units increase. As shown
in Fig. 3.6, the number of on-chip buffer accesses is minimized by unrolling loop-4
since input feature can be re-used. In order to minimize partial sums storage, loop-1
and loop-2 are fully tiling and intra-tiling order can be loop-1 → loop-2 → loop-4 →
loop-3.
46
Fig. 3.7 illustrates different inter-tiling loop orders. We analyzed five cases of
inter-tiling loop order in VGG-16 convolution layers since inter-tiling loop order de-
termines the data movement from external memory to on-chip buffer. In case 1, the
number of DRAM access per pixel of input features can be 1 since input features are
loaded only once from off-chip memory, but the number of DRAM access per pixel
of kernel features can be the number of tiles for input features. On the other hand,
in case 2, the number of DRAM access per pixel of kernel features can be 1, but the
number of DRAM access per pixel of input features can be the number of tiles for
kernel features. Table 3.2 shows that the total DRAM access can be saved by 2.12×
based on our inter-tiling loop order strategy.
• Case 1: All tiles in loop-4 are computed first and the tiles in loop-3 are computed
at the end.
• Case 2: All tiles in loop-3 are computed first and the tile in loop-4 are computed
at the end.
• Case 3,4: Applying zero skipping technique on case1 and case2, respectively
• Case 5: Applying case3 and case4 on each layer differently
Figure 3.6: Analysis of unroll loop-2 vs. unroll loop-4.
47
Figure 3.7: Illustration of inter-tiling loop order.
3.4 Energy-Efficient DCNN Architecture Based on PC and ZS
We proposed a special and efficient architecture for precision-cascading and zero-
skipping as shown in Fig. 3.8. In order to maximize the effect of zero-skipping
scheme, it can be the best way that we load and compute input feature across the
channel. Only non-zero values of input channel features at one pixel can come into
PE array and they can be re-used to generate 9 output features. While input features
are loaded from on-chip memory for activations, we can load the only kernel features
that are coincident with input features from on-chip buffer for weights. Therefore,
we not only can fully skip zero values of input features, but also can skip kernel
features corresponding to the zero values of input data. In addition, we employ
holding/shifting partial sum scheme in order to maximize input feature re-use. For
example, when input features at (x, y) come into PE array, each PE of 9 PEs computes
MAC operation for different output features at different pixel locations. And then,
when the next input features at (x, y + 1) come into PE array, each PE computes
Table 3.2: Analysis Results of DRAM Access on Different Cases of Inter-tiling
Strategy
Case1 Case2 Case3 Case4 Case5
DRAM access (MB) 73.49 117.1 67.59 57.64 55.31
48
MAC operation with the previous partial sums that are shifted. Some partial sums are
moved into on-chip buffer to hold until they can be accumulated with MAC operation
in PEs.
Fig. 3.9 shows the top-level block diagram and data flow. Non-zero values of
all input features and the sparsity map data are stored into external memory. All
kernel features are also stored into external memory. And then, input features and
kernel features are loaded from the SDRAMs within the size of tile through SDRAM
controller to be stored into on-chip memory. After finishing the storage, PE array
starts to compute MAC operations with input features and kernel features that come
Figure 3.8: The concept of special and efficient architecture for PC and ZS.
49
from SRAM by input and kernel feature data controller. As shown in Fig. 3.9, each
PE consists of MAC unit, register, and control unit. The register is to hold partial
sums and the control unit is to shift partial sums to adjacent PE, external register, or
SRAM. Partial sums in the SRAM and the registers come back to PE array by Partial
sums controller when they are needed. After some initial latency, PE3 3 generates
final output features every cycle. The final output features go through ReLU/Pooling
module and then they are stored into external memory by SDRAM controller. All
processes are fully pipelined.
Our proposed architecture consists of an array of 324 PEs (= 18 × 18), which can
support various kernel size of 3 × 3, 5 × 5, and 7 × 7, as shown in Fig. 3.10. For
3 × 3 each PE includes 36 sub-PEs, which is 100% PE utilization. In noraml layer
case, we can compute 12 output features in parallel. In PC+ZS layer case, 18 output
features can be processed in parallel. For 5 × 5, each PE includes 12 sub-PEs, which
is 92.6% PE utilization. In normal layer case, we can compute 4 output features in
parallel. In PC+ZS layer case, 6 output features can be processed in parallel. For
7 × 7, each PE includes 6 sub-PE, which is 90.7% PE utilizaation. In normal layer
case, 2 output features can be processsed in parallel. In PC+ZS layer case, we can
compute 3 output features in parallel. The order of PEs for 5 × 5 and 7 × 7 are
placed by optimal data path to move partial sums between PEs.
3.5 Implementation Results in 40nm CMOS
The proposed DCNN accelerator was implemented in 40nm CMOS. The chip
micrograph is shown in Fig. 3.11, where the total area is 3×3 mm2, including the
off-chip memory controller. The chip specifications are summarized in Table 3.3.
We implemented SDRAM controller to achieve energy-efficient of data movement
between our proposed accelerator and off-chip memory by exploiting data. Since we
50
F
ig
u
re
3
.9
:
T
op
-l
ev
el
b
lo
ck
d
ia
gr
am
an
d
th
e
en
d
-t
o-
en
d
d
at
a
fl
ow
of
p
ro
p
os
ed
D
C
N
N
ac
ce
le
ra
to
r.
51
(a)
(b)
(c)
Figure 3.10: Supports various kernel size for (a) 3 × 3, (b) 5 × 5, (c) 7 × 7.
52
Table 3.3: Chip Specifications
CMOS Tech. TSMC 40nm GP
Core area 3mm × 3mm
Gate Count (NAND2) 2.87M gates
# PEs 324 (= 18 x 18)
On-chip SRAM
339.5 KB
(weights: 121.5KB, activation: 176KB
sparsity map: 39KB, partial sums: 3KB)
REG for partial sums 2.11 KB
Nominal Voltage 0.9 V
Core Frequency 400 MHz
Power 203 mW
stored only non-zero values of activations through fully zero-skipping scheme, we can
reduce the number of DRAM access significantly. Fig. 3.12 shows an example of
loading sparsity map and input activation. After loading sparsity map data from off-
chip memory, the non-zero values of activation can be loaded from off-chip memory
when the sparsity map data includes ’1’. We achieved 5.8× reduction of DRAM
access in VGG-16 convolution layers comparing to Eyeriss (Chen et al. (2016)) as
shown in Fig. 3.13.
Figure 3.11: The proposed accelerator chip of micrograph.
53
The measured chip performance (frame/second) and total/leakage power con-
sumption with dynamic voltage scaling are shown in Fig. 3.14. The proposed ac-
celerator chip was fully functional down to 0.6V where the chip demonstrated 1.9
fps with 47mW power. Fig. 3.15 shows the area and measured power breakdown of
the prototype chip. 27% of the total chip area is occupied by on-chip SRAM arrays
in order to store the kernel weights, input/output feature maps, and partial sums.
On the other hand, 10 of the total chip power was consumed by the SRAM arrays
since the proposed precision cascading with fully zero skipping schemes reduce power
consumption of memory access significantly.
Table 3.4 summarized the performance breakdown of convolution layers in the
VGG-16 for ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC2012)
valid set. We achieved a peak 8.88 TOPS/W in convolution layer 5-3 and 1.22
TOPS/W in average for VGG-16 convolution layers in ILSVRC2012 valid sets while
Figure 3.12: Example of loading sparsity map/input features from external memory.
54
Figure 3.13: DRAM access comparison to Eyeriss in VGG-16 Convolution layers.
consuming 203mW at 0.9V. The required DRAM access for an inference is 55.31 MB,
or 0.0018 access/MAC.
We compared the latency and energy-efficiency between when applying PC with
fully ZS schemes and when not apply PC with fully ZS scheme for VGG-16 with
ILSVRC2012 valid set. Fig. 3.16 shows that we achieved substantial latency re-
duction overall layers by using PC with fully ZS schemes. Fig. 3.17 shows that
we achieved significant high TOPS/W number, especially at convolution layer in-
cluding max-pooling. We note that DCNNs on ImageNet such as VGG-16 have less
max-pooling layers such that the total energy-efficiency is not much higher than the
energy-efficiency of the convolution layers including max-pooling. However, we can
expect significant increase of energy-efficiency overall when applying DCNNs on ap-
plications such as autonomous driving and medical imaging that are employing high
resolution image since they should need much more max-pooling layers.
3.6 Conclusion
In this chapter, we presented a 40nm energy-efficient accelerator for deep convo-
lution neural network. We proposed precision-cascading scheme to reduce redun-
55
Figure 3.14: Measured frame rate per second and total/leakage power with voltage
scaling.
(a) (b)
Figure 3.15: (a) Power and (b) area breakdown of the overall system.
dant convolutional operations due to max pooling. In addition, integrating the
precision-cascading with fully zero-skipping by exploit zero data, we achieved sig-
nificant reduction of energy and external memory accesses. The accelerator achieved
a peak 8.88 TOPS/W and an average 1.22 TOPS/W for VGG-16 convolution layers
in ILSVRC2012 valid sets while consuming 203mW at 0.9V. The proposed convolu-
tion loop acceleration strategy with fully zero skipping scheme reduced the number
of off-chip memory access by overall 2.12× in VGG-16 convolution layers.
56
Figure 3.16: Latency comparison between w/ PC + ZS and wo/ PC + ZS.
Figure 3.17: Energy-efficiency comparison between w/ PC + ZS and wo/ PC + ZS.
57
T
a
b
le
3
.4
:
P
er
fo
rm
an
ce
B
re
ak
d
ow
n
of
th
e
V
G
G
-1
6
in
IL
S
V
R
C
20
12
L
a
y
e
r
O
p
e
ra
ti
n
g
la
te
n
cy
#
O
p
s
Z
e
ro
s
in
a
ct
iv
a
ti
o
n
(G
r1
/
G
r2
&
3
)
G
O
P
S
T
O
P
S
/
W
D
R
A
M
A
cc
e
ss
C
O
N
V
1
-1
1
.8
7
m
s
0
.1
7
7
G
5
6
.2
%
/
1
.4
8
%
9
4
.2
9
0
.4
6
0
.2
2
M
B
C
O
N
V
1
-2
7
.6
3
m
s
3
.7
G
9
9
.4
%
/
4
8
.8
%
4
8
5
.7
2
.3
9
2
.6
1
M
B
C
O
N
V
2
-1
1
7
.8
m
s
1
.8
5
G
8
1
.1
%
/
1
9
.5
%
1
0
3
.9
9
0
.5
1
1
.0
5
M
B
C
O
N
V
2
-2
1
2
.5
2
m
s
3
.7
G
7
5
.4
%
/
3
4
.3
%
2
9
5
.6
4
1
.4
6
3
.2
1
M
B
C
O
N
V
3
-1
1
4
.5
4
m
s
1
.8
5
G
6
1
.4
%
/
3
1
.2
%
1
2
7
.2
6
0
.6
3
2
.1
5
M
B
C
O
N
V
3
-2
2
1
.8
1
m
s
3
.7
G
7
1
.3
%
/
4
8
.4
%
1
6
9
.6
8
0
.8
4
7
.4
3
M
B
C
O
N
V
3
-3
1
2
.4
9
m
s
3
.7
G
6
8
.1
%
/
4
9
.5
%
2
9
6
.1
9
1
.4
6
5
.7
3
M
B
C
O
N
V
4
-1
1
0
.2
6
m
s
1
.8
5
G
6
7
.8
%
/
5
2
.6
%
1
8
0
.3
3
0
.8
9
3
.5
6
M
B
C
O
N
V
4
-2
1
4
.7
6
m
s
3
.7
G
8
1
.8
%
/
6
5
.9
%
2
5
0
.6
3
1
.2
3
1
1
.6
6
M
B
C
O
N
V
4
-3
4
.2
9
m
s
3
.7
G
9
1
.4
%
/
7
3
.5
%
8
6
2
.8
8
4
.2
5
7
.4
1
M
B
C
O
N
V
5
-1
2
.6
1
m
s
0
.9
2
G
8
4
.8
%
/
7
5
.9
%
3
5
5
.0
3
1
.7
5
3
.4
3
M
B
C
O
N
V
5
-2
2
.4
5
m
s
0
.9
2
G
9
1
.5
%
/
7
7
.3
%
3
7
7
.5
4
1
.8
6
3
.4
2
M
B
C
O
N
V
5
-3
0
.5
1
m
s
0
.9
2
G
9
7
.7
%
/
8
0
.2
%
1
8
0
2
.2
8
.8
8
3
.4
2
M
B
T
o
ta
l
1
2
3
.5
5
m
s
3
0
.7
G
7
9
.1
%
/
5
0
.7
%
2
4
8
.5
8
1
.2
2
5
5
.3
1
M
B
58
Chapter 4
ENERGY-EFFICIENT IN-MEMORY COMPUTING ACCELERATOR FOR
DEEP NEURAL NETWORKS
This chapter proposes an energy-efficient in-memory computing accelerator for
deep neural networks. In conventional digital designs for deep learning computa-
tion, the biggest bottleneck for energy-efficient deep neural networks (DNNs) has
reportedly been the data access and movement. To eliminate the storage access bot-
tleneck, new SRAM macros that support in-memory computing have been recently
demonstrated. Several in-SRAM computing works have used the mix of analog and
digital circuits to perform XNOR-and-ACcumulate (XAC) operation without row-by-
row memory access and can map a subset of DNNs with binary weights and binary
activations. In the single array level, large improvement in energy efficiency (e.g.,
two orders of magnitude improvement) has been reported in computing XAC over
digital-only hardware performing the same operation. In this work, by integrating
many instances of such in-memory computing SRAM macros with an ensemble of pe-
ripheral digital circuits, a new DNN accelerator is proposed. This new accelerator is
designed to support configurable multibit activations and large-scale DNNs seamlessly
while substantially improving the chip-level energyefficiency with favorable accuracy
tradeoff compared to conventional digital ASIC. The proposed accelerator is fully
designed and laid out in 28nm CMOS, demonstrating ultralow energy consumption
for DNNs.
59
4.1 Introduction
In recent years, deep learning and deep neural networks have unprecedentedly
improved the accuracies in large-scale recognition tasks. However, to achieve incre-
mental accuracy improvement, the state-of-the-art deep learning algorithms tend to
present very deep and large network models (e.g., 1000-layer networks (He et al.
(2016))), and this poses significant challenges for DNN hardware implementations in
terms of computational complexity, memory access, and the associated energy cost.
A number of prior works from algorithms to hardware reduced the energy cost.
On the algorithm side, pruning and compression have been extensively studied (Han
et al. (2015)), substantially reducing the number of nonzero parameters. In addition,
a number of low-precision techniques (Hubara et al. (2017), Zhou et al. (2016), Guan
et al. (2017b)) have been investigated with minimal degradation in the classification
accuracy.
On the hardware side, many digital application-specific integrated circuit (ASIC)
designs in CMOS (e.g., IBM TrueNorth (Merolla et al. (2014)), Eyeriss(Chen et al.
(2016)), and ENVISION(Moons et al. (2017))) have been previously presented to
help bring expensive algorithms to a low-power processor. However, limitations still
exist on memory footprint, on-/off-chip communication, and accuracyenergy tradeoff.
It is still a challenging task to enable essential deep learning processors in mobile,
wearable, Internet of Things (IoT), and extreme implantable devices due to their
divergent constraints in low power and small footprint.
In particular, the CMOS ASIC designs (Merolla et al. (2014), Chen et al. (2016),
Moons et al. (2017), Shin et al. (2017)) show that accessing memory is the biggest
bottleneck for energy-efficient real-time cognitive computing in terms of storing mil-
lions of parameters, loading them from embedded SRAM memory, and moving them
60
Figure 4.1: Left: conventionally SRAM data are read you row-by-row to perform
computation at the periphery. Right: in-memory computing schemes embed logic
computation inside SRAM by turning on all row simultaneously.
to where computing actually occurs. Although SRAM technology has been following
the CMOS scaling trend well, to compute Multiply-and-ACcumulate (MAC) opera-
tions in DNNs, conventional SRAMs still require millions to billions of row-by-row
accesses, which limits the parallelism and dissipates a large amount of read/write
energy.
To improve this limitation, in the last couple years, several works proposed the in-
SRAM computing concept (see Fig. 4.1), which performs computation in the SRAM
hardware without reading out each row of SRAM to a computing unit. However,
most prior works only demonstrated small DNNs for MNIST data set with relatively
low accuracy (less than 96%) (Biswas and Chandrakasan (2018), Chen et al. (2018b),
Khwa et al. (2018)).
Although a large amount of energy reduction is demonstrated, it should be noted
that most in-SRAM computing works that turn on all rows or columns simultaneously
61
only demonstrate a relatively small custom SRAM array at the single-array level
(Biswas and Chandrakasan (2018), Chen et al. (2018b), Jiang et al. (2018)). To
implement an overall DNN accelerator using in-memory computing SRAMs, there
are several important challenges and missing pieces, including integration of many
in-memory computing SRAM arrays, activation storage/communication, additional
digital logic for non-MAC operations, and row-by-row write energy consideration.
In this work, we substantially expanded the single-arraylevel prior XNOR-SRAM
design (Jiang et al. (2018)) toward a configurable DNN accelerator architecture that
integrates 72 XNOR-SRAM arrays with interarray communication and supports a
wide range of DNN/CNN algorithms with configurable activation precision. The
proposed accelerator supports 3×3 and 1×1 convolutional kernels and up to 256
feature maps in a convolutional layer.
The main contributions of this article are as follows.
• Chip-level DNN/CNN accelerator architecture that employs many instances of
in-memory computing SRAM (e.g., XNOR-SRAM) macros with a methodology
to efficiently load/map weights onto such XNOR-SRAM arrays for convolution
layers and fully connected layers of DNNs
• Peripheral digital logic that can support multibit activations, which becomes
an effective knob to favorably tradeoff energy versus accuracy by reusing the
XNOR-SRAM macro
• Double-buffering technique with two groups of in-memory computing SRAMs,
which effectively hides the latencies of reprogramming in-memory computing
SRAM arrays with new DNN weights
• Evaluatition of the chip-level energy benefits and remaining bottlenecks of in-
memory computing-based DNN accelerators
62
The remainder of this chapter is organized as follows. In chapter 5.2, recent in-
memory computing hardware designs for DNN are reviewed. Chapter 5.3 presents
the XNOR-SRAM macro. Chapter 5.4 discusses practical design challenges when de-
signing a chip-level DNN accelerator using many in-memory computing macros, such
as XNOR-SRAM. In chapter 5.5, the microarchitecture of the proposed accelerator is
described, including optimal precision study and the methodology to efficiently map
optimized DNNs/CNNs onto XNOR-SRAM arrays. Chapter 5.6 reports experimen-
tal results on speed and energy dissipation across several workloads. Finally, this
chapter is concluded in chapter 5.7.
4.2 Recent In-Memory Computing Hardware Designs for DNN
Recently, a large amount of attention has been drawn to develop DNNs that only
use binary (+1 and 1) weights, demonstrating orders of magnitude reduction in com-
putational complexity at tolerable accuracy degradation (Rastegari et al. (2016), Zhou
et al. (2016), Hubara et al. (2016), Guan et al. (2017a)). This advent of the binary-
weight DNNs and CNNs opens a new possibility for SRAM-based in-memory com-
puting, since each weight in those algorithms can be nicely stored in a single SRAM
bitcell. By turning on multiple or all rows simultaneously, the input/activation values
are applied as wordline (WL) voltages, which, in turn, interact with the bitcells to per-
form MAC computation, typically in an analog manner. This can eliminate explicit
memory access, which otherwise pose energy/performance bottlenecks in DNN/CNN
hardware implementations. A number of works have recently demonstrated this type
of in-SRAM computing (Biswas and Chandrakasan (2018), Khwa et al. (2018), Valavi
et al. (2018), Jiang et al. (2018)), and we summarized representative works in Fig.
4.2.
63
Figure 4.2: Comparison of recent in-SRAM computing hardware demonstrations.
In (Zhang et al. (2016)), in-SRAM computing hardware in 130-nm CMOS was
demonstrated. This design employs binary weights, each of which is stored in a
6T bitcell. The 5-bit inputs are converted to analog voltages via digital-to-analog
converters (DACs) embedded in the address decoder, which drives the WLs. Each
WL voltage modulates the resistance of access transistors of bitcells of that row.
Depending on the weight stored in each bitcell, the bitcell either discharges or charges
the BLs, making BL voltage proportionally grow with the MAC computation results.
The BL voltage is finally digitized into a binary value by a single-sense amplifier in
the column circuitry.
This work employs 6T SRAM circuits, promising a compact silicon footprint.
However, it cannot support mainstream DNN and CNN algorithms. Even though
binarized neural network (BNN) algorithms (Rastegari et al. (2016), Hubara et al.
(2016)) binarize the activation at the output of each layer, still the partial sum and
accumulation need to be performed with high precision. Since the partial sum result
at each SRAM output is binarized prematurely in (Zhang et al. (2016)) and if the
neural network layer cannot fit in one SRAM array, the final neural network accuracy
can be considerably degraded. By combining many weak classifiers (shallow neural
64
networks), a boosting classifier is demonstrated, but only achieved 90% accuracy for
MNIST data set.
Conv-RAM (Biswas and Chandrakasan (2018)) integrates DACs for analog inputs,
binary weights stored in SRAM, and analog-to-digital converters (ADCs) to convert
the in-memory computation results back to digital values. In-memory computation
in (Biswas and Chandrakasan (2018)) targets convolution operation, which is accom-
plished by rowwise charge sharing of the SRAM bitcells in the same row. To perform
this, local analog multiply-and-average circuits are added every 16 rows (out of the
256-row custom SRAM array). However, within a block of 16 rows, the in-memory
SRAM still goes through row-by-row operation, and the integrating ADC exhibits
slow speed. MNIST accuracy of 96% was reported, but only 100 test images were
used for the accuracy calculation.
Khwa et al. (2018) demonstrated MAC operations using a 4-kb SRAM for fully
connected neural networks in edge processors. This article proposed techniques to
mitigate the challenges of excessive current, sense-amplifier offset, and sensing refer-
ence voltage optimization, arising due to simultaneous activation of multiple WLs in
the in-memory computing scheme. However, similar to (Zhang et al. (2016)), each
column output is binarized with a singlesense amplifier, which limits the accuracy
and scalability to arbitrary large DNNs. In addition, only fully connected layers of
DNNs are mapped onto in-SRAM computing, and the measured MNIST accuracy
was limited to 95.1%.
PROMISE (Srivastava et al. (2018)) is a programmable mixed-signal accelerator
that supports diverse machine learning algorithms with a custom instruction set ar-
chitecture (ISA) and compiler support. However, they only demonstrated machine
learning tasks on relatively simple benchmarks, including MNIST and MIT-CBCL
data sets.
65
Recently, a binarized CNN accelerator was presented in (Valavi et al. (2018)),
which also performs a modified batch normalization with analog computation, and
reported 83.27% test accuracy for CIFAR-10 data set. However, since each column
end is binarized with a single-sense amplifier, it cannot naturally support ensuing
high-precision operations such as max-pooling and also lacks scalability for larger
CNNs.
In this work, we are focusing on SRAM as the memory substrate of in-memory
computing. On the other hand, researchers have also presented in-memory comput-
ing designs based on emerging nonvolatile memory technologies, such as phase-change
memory (PCM) (Burr et al. (2015)), resistive RAM (RRAM) (Chen et al. (2018b)),
and magnetic RAM (MRAM) (Parveen et al. (2018)). As compared to SRAM, these
technology promises a bitcell that can store multibit weight, often in the form of
analog variables (e.g., resistance), in a smaller footprint. This is a significant bene-
fit to supporting DNN and CNN algorithms using multibit weights. However, these
emerging memory devices exhibit significant challenges, including high variability and
nonlinearity (Gokmen and Vlasov (2016)), and most importantly, the manufactura-
bility has not reached that of CMOS technologies, which severely limits system-level
integration with many large arrays.
4.3 XNOR-SRAM: Scalable SRAM Macro for In-Memory Computing
In (Jiang et al. (2018)), a new mixed-signal in-memory computing SRAM macro ti-
tled XNOR-SRAM was presented, which extends the scalability and efficient mapping
capability of a wide range of neural networks. It performs XNOR-and-ACcumulate
(XAC) operations in BNNs (Rastegari et al. (2016), Hubara et al. (2016)), which
replaces MAC operations in nonbinary DNNs, with high speed and energy efficiency.
66
Figure 4.3: XNOR-SRAM design proposed in (Jiang et al. (2018)). (a) XNOR-
SRAM macro can map the XNOR-and-accumulate operation of binary-weight DNNs.
The top shows logical computation and the bottom shows the schematics of XNOR-
SRAM. (b) XNOR-SRAM bitcell design. Devices T7-T12 are added to a 6T SRAM.
(c) XAC operation with ternary activations and binary weights. (d) Measured VRBL
over corresponding logical results of 256-input XAC operations.
Fig. 4.3(a) presents the reported XNOR-SRAM array and peripheries, which
can map convolution and fully connected layers of CNNs and multilayer perceptrons
(MLPs). It consists of a 256-by-64 custom SRAM array, a row decoder, and a read
periphery including a 3.46-bit (11-level) flash ADC. Two modes of operations exist
for XNOR-SRAM. In the memory mode, it performs row-by-row digital read and
write as regular memory circuits. In the XNOR mode, it performs in-memory MAC
computation with all rows asserted simultaneously.
67
Fig. 4.3(b) shows the 12T SRAM bitcell proposed in (Jiang et al. (2018)). T1-T6
form the conventional 6T SRAM bitcell; T7-T10 form complimentary pull-up/-down
circuits for the XNOR mode; T11 and T12 can power-gate the pull-up/-down circuits,
saving power when the corresponding column is not needed for computation. In each
bitcell, a binary weight (+1/1) is stored in the 6T SRAM and input signal (+1/1/0)
is represented by four RWL signals (RWL P, RWL N, RWLB P, and RWLB N), as
shown in Fig. 4.3(c). When T11 and T12 are on, pull-up or pull-down paths are
formed depending on the product of weight and input [see Fig. 4.3(c)]. For example,
when the product is +1, one strong pull-up path and one weak pullup path are formed;
when the product is 1, one strong pulldown path and one weak pull-down path are
formed.
Parallel pull-up and pull-down paths from all bitcells (controlled by bitwise XNOR
outputs) in a column form a voltage divider, where RBL is the output node. VRBL
is a monotonic function of XNOR bitcount [see Fig. 4.3(d)]; therefore, we can obtain
the XAC results by digitizing VRBL with the ADC. Due to the relatively large ADC
area overhead, we share one ADC across 64 columns. For a given binary/ternary
256-dimension input vector, a 6-to-64 column decoder along with 64-to-1 analog mul-
tiplexer [see Fig. 4.3(a)] is employed to cycle through all the 64 columns in 64 cycles.
When a column is selected, REN/RENB are set to VDD/0 and the pull-up/-down
paths are connected at RBL; REN/RENB are set to 0/VDD for other 63 columns,
where both T11 and T12 are turned off, breaking the short-circuit path between
pull-up/-down paths. In one cycle, XNOR-SRAM supports computation with binary
weights (+1/1) and binary inputs (+1/1 or +1/0) as well as ternary inputs (+1/0/1).
The embedded ADC plays a key role in speed and DNN accuracy. Employing 11 lev-
els (3.46 bit) reportedly provides relatively high accuracy, and nonlinear quantization
based on statistical distribution of XAC values can further improve it. Fig. 4.3(d)
68
shows the measured VRBL for different XAC values. In the nonlinear quantization
scheme, the worst case 3-σ deviation is equivalent to 1.78 LSB. Note that the worst
case deviation is smaller for other VRBL values, e.g., 0.83 LSB when VRBL is 0.25
VDD.
4.4 Practical Challenges of In-Memory Computing-Based Accelerators
Although XNOR-SRAM as well as other in-memory computing SRAMs show
promising energy efficiency at the single-array-level, there are several important chal-
lenges and missing pieces toward building a chip-level DNN accelerator using these
arrays.
1. Integration of Many In-Memory Computing SRAM Arrays
First, it should be noted that XNOR-SRAM as well as most of the in-memory
computing hardware only demonstrated a relatively small custom SRAM ar-
ray (around a few hundred kb or less). Therefore, to implement an overall
DNN accelerator, many of these in-memory computing SRAM arrays need to
be employed and integrated together with on-chip communication networks.
2. ADC Overhead and Offset Cancellation
Second, the ADC design incurs area and power overhead, which adversely af-
fects the array efficiency and energy efficiency. In addition, ADC performance
is sensitive to offset or variability, and for DNN applications, evaluation should
be made on how much DNN accuracy degradation occurs due to the variabil-
ity. Ideally, the ADC offsets should be calibrated out using offset compensa-
tion circuits (Chen et al. (2009)), but such calibration circuits will increase the
area/power further.
69
3. Postprocessing Modules
Third, toward evaluating accuracy for CNNs or DNNs, these designs employ a
considerable amount of postprocessing modules, such as partial sum accumu-
lation, batch normalization, pooling, nonlinear activation, and so on. These
postprocessing modules add system-level design complexity and energy con-
sumption beyond those of the single in-memory computing SRAM array.
4. Activation Storage and Communication
Fourth, typically, the in-memory computing SRAM arrays store the DNN weights,
while the activations are applied as inputs to the custom SRAM array either at
the WLs or BLs. Different activation values need to be applied to the SRAM
array every cycle, which means that the activations need to be stored in a sepa-
rate memory (e.g., another SRAM array) and communicated to the in-memory
computing SRAM array at the right time. However, the energy of activation
storage and communication are typically not included in the in-memory com-
puting SRAM macro energy values.
5. Write Energy
Finally, since the state-of-the-art DNNs can be very large, we will not be able
to store the entire weights of DNNs in in-memory computing SRAMs without
consuming a huge amount of area and static power. To that end, it will be
necessary to reload different weights in-memory computing SRAM arrays at
different times. While XAC operation for in-memory computing SRAMs can
be fully parallelized by turning on all the rows, write operation of loading new
weights to the SRAM still requires row-by-row operation, which incurs long
latency and consumes write energy. Typically, the reported in-memory com-
puting SRAM macros do not include any write energy, but for a system-level
70
accelerator, the corresponding write energy of in-memory computing SRAM
macros should be characterized and included.
4.5 Microarchitecture of the Proposed Accelerator
This chapter describes how a number of the key challenges have been addressed
and inclusively implemented in the proposed accelerator design and reports the ac-
celerator evaluation results.
4.5.1 Microarchitecture Overview
We designed the microarchitecture of the Vesti accelerator, which integrates the
aforementioned key missing pieces, and can execute inference for a wide range of
DNNs/CNNs. Fig. 4.4(a) shows the overall microarchitecture that computes a deep
CNN inference on a layer-by-layer basis. Employing the double-buffering scheme
(Sancho and Kerbyson (2008)), it consists of two symmetric cores, one of which per-
forms the in-memory computation, while the other can load new weights from an
on-chip global buffer or an off-chip DRAM. Each block consists of: 1) the ensemble of
the XNOR-SRAM macros that perform XAC from convolution and fully connected
layers and 2) a digital ALU that performs other operations, such as batch normal-
ization, activation, and max-pooling. In this article, 36×2 XNOR-SRAM macros
are employed (see Fig. 4.4) to support representative CNNs for CIFAR-10 data set
(Hubara et al. (2016)). To support larger CNNs, the XNOR-SRAM macros can be
timemultiplexed and reused over time. At 1 V, the XNOR-SRAM macro can operate
at 1.2 GHz and the digital ALU was synthesized and placed/routed at 0.55 GHz in
the same 65-nm CMOS technology.
The inputs and weights are fetched from off-chip DRAM. The inputs are saved in
the activation memory buffer and weights are written into the XNOR-SRAM macros.
71
Figure 4.4: (a) Overall microarchitecture of the proposed in-memory computing
accelerator. (b) Computations for the thermometer-to-binary conversion, LUT, batch
normalization, and so on. (c) Block diagram of the activation memory buffer.
The 36 XNOR-SRAM macros in each core are divided into four groups. The four
groups share the input activations, performing XAC operations for up to 256 (=64×4)
output feature maps in parallel. The nine XNOR-SRAM macros in each group can
accept inputs from up to 256 input feature maps when performing 3×3 convolution
and up to 2304 (=256×9) inputs when performing fully connected matrix vector
multiplication. In each cycle, up to 36 256-input XAC operations can be executed
in parallel. The outputs of 36 XNOR-SRAM macros are processed by a 256-way
digital ALU, where ADC output decoding, partial sum accumulation, max-pooling,
batch normalization, and binary/ReLu activation are performed. The results will be
saved back to the activation memory buffer of the other core, which will perform
computation for the ensuing layer in a similar fashion.
Double-buffering technique (Sancho and Kerbyson (2008)) is employed to hide the
weight loading latency. While one core is performing computation for one layer, the
other core loads the weights for the next layer. For example, for two adjacent convo-
lution layers, of which the channel size and map size is 256 and 16×16, respectively,
72
Figure 4.5: Timing diagram of the accelerator operation for two adjacent layers of
a CNN, including in-memory computing, double-buffering, and peripheral computa-
tions.
the 36 XNOR-SRAM macros in one core complete the convolution in 256 cycles;
at the same time, the 36 XNOR-SRAM macros in the other core load the weights
for the next layer in 256 cycles (row by row). Fig. 4.5 shows the timing diagram
that illustrates these operations for two adjacent layers of a CNN. This layer-by-layer
operation will continue to perform all the layers of a given DNN/CNN.
4.5.2 Multibit Activation Support
While the XNOR-SRAM macro has native support on binary activations and
binary weights, it should be noted that DNNs with binary activations and binary
weights do not yet reach the same accuracy level of their higher precision counterparts
(Rastegari et al. (2016), Hubara et al. (2016)). Therefore, in our proposed Vesti
accelerator, we support weights with binary precision (+1 or 1) but activations with
configurable precision from 1 to 4 bit. This scheme will not only minimize the weight
memory footprint but can also reach the level of DNN accuracies with floating-point
precision (Zhou et al. (2016)).
73
Our choice on binary weight and multibit activation is based on the algorithm
level experiments that we conducted. In particular, we swept: 1) several CNN sizes
(various numbers of feature maps per layer) for the CIFAR-10 data set and (2) acti-
vation precision values including binary, ternary, 2, 4, 8, and 32 bit. The results are
shown in Fig. 4.6. 1× CNN represents the network of input-128C3-128C3-MP2-
256C3-256C3-MP2-512C3-512C3-MP2-1024FC-1024FC-10FC, which was presented
in (Hubara et al. (2016)). Here, 128C3-128C3 refers to the convolution layer with
128 input feature maps, 3×3 kernels, and 128 output feature maps, MP2 refers to
2×2 max-pooling, and 1024FC refers to the fully connected layer with 1024 hidden
neurons. 0.5 CNN represents the network input-64C3-64C3-MP2-128C3-128C3-MP2-
256C3-256C3-MP2-512FC-512FC-10FC, where the number of feature maps in all con-
volution layers is reduced by half compared to those in the 1 CNN, and the number
of hidden neurons in the fully connected layers is reduced by half as well. Similarly,
0.25 CNN and 0.125 CNN represent the networks, where all dimensions are reduced
by 4 and 8, respectively, compared to the 1 CNN.
It can be seen in Fig. 4.6 that the accuracy of models using floating-point activa-
tion precision could be reached by employing 3-/4-bit activation precision with binary
weights. However, employing binary activations do show considerable degradation in
CNN accuracy. To that end, we use in-memory computing hardware with fixed binary
weights, but employ configurable precision for the inputs and neuron activations at
the periphery of the XNOR-SRAM array.
In the microarchitecture level, the support of the multibit inputs is done by per-
forming XAC operation for each bit of input/activation using XNOR-SRAM macros
and then shiftand-accumulate the bitwise XAC results in the digital peripheries over
multiple cycles (e.g., N cycles for N-bit precision of activations). This is shown in
Fig. 4.4(b) together with other digital computations at the periphery. Configurable
74
Figure 4.6: Classification accuracy for CIFAR-10 data set is shown across different
activation precision values for four different DNN sizes. For all data points, the weight
precision is binary (only two values of +1 or -1).
precision (from 1 to 4 bit) for activations can be flexibly supported at the cost of
additional clock cycles.
4.5.3 Activation Memory
Activation memory in each core is employed to store the input feature maps for
that core and to save the output activations from the other core. Take a 256C3-256C3
convolution layer as an example. This convolution layer consists of 256 input feature
maps, each of 32×32 pixels. The input feature maps stored in an activation memory
need to be read out to perform convolution. 256×3×3 pixels should be fetched out and
presented to the 36 XNOR-SRAM macros at every cycle. A relatively large FIFO-
based buffer between the activation memory and XNOR-SRAM array can reduce
activation memory access by exploiting convolutional data reuse. However, this will
75
result in considerable power consumption in the buffer and significant area for the
control logic.
To get rid of the buffer, we propose a new way to store the feature maps in the
activation memory. In particular, we divide the overall activation memory into nine
activation SRAM blocks of 128 rows and 256 columns, as shown in Fig. 4.7. The
input feature maps are divided into 3×3 tiles. The input feature maps pixels are
grouped and stored in the nine SRAM blocks according to their position in the 3×3
tiles they belong to. By storing pixels in this fashion, we can read all the 256×3×3
pixels in a single cycle given the fact that any 3×3 patch of input feature maps is
now stored in nine different SRAM blocks. A controller block is designed to generate
corresponding addresses for the nine SRAM blocks.
As shown in Fig. 4.4(c), the activation memory buffer has four parts: 1) coor-
dinate generator; 2) address decoder; 3) rewiring logic; and 4) XNOR-SRAM input
interpreter module, along with the 18 SRAM blocks. The functionalities of these
blocks are described in more detail in the following.
1. Coordinate Generator: The coordinate generator block generates the x- and y-
coordinates of the output map pixels sequentially in a row-major order. In case
that the convolutional layer is followed by a max-pooling layer, the coordinates
correspond to each pooling window will be generated in a row-major order
inside each pooling window to ease the buffer size requirement for max-pooling
operation. These coordinates will serve as inputs to our address decoder and
rewiring logic blocks to different combinations of row addresses and read enable
signals for SRAM blocks.
2. Address Decoder: The address decoder block generates activation memory ad-
dresses for the nine SRAM blocks according to the output feature map coordi-
76
Figure 4.7: Illustration of convolution layer feature map storage and access scheme
in nine independent SRAM arrays. (a) 3×3 window starting from (0,0). (b) 3×3
window starting from (0,1). (c) 3×3 window starting from (2,2).
nates. Zero padding can be supported smoothly by the address decoder as well.
When the generated addresses are found invalid for the input feature maps,
corresponding read enable signals will be inactive and substitute the SRAM
output with zero values. Since the feature map size and channel size vary from
layer to layer, we further divide each 256-bit-word-length SRAM block into two
128-bit-word-length SRAM blocks. For some layers (e.g., map size = 32×32
and channel size = 128), we concatenate these two in depth direction to form
77
a deeper SRAM block with 128-bit word length; for some layers (e.g., map size
= 16×16 and channel size = 256), we concatenate these two in word direction
to form a wider SRAM block with 256-bit word length.
3. Rewiring Logic: As shown in Fig. 4.7, although we can always fetch any 3×3
patch in input feature maps from nine different SRAM blocks at the same time,
the order of the nine SRAM outputs do not always align with the row-major
order in each 3×3 patch. We need to rewire the SRAM outputs to make sure
the 3×3 patches be presented to the XNOR-SRAM macros in correct order.
There are nine different rewiring patterns in total, depending on the 3×3 patch
row and column offset remainder modulo by 3. Fig. 4.7 shows three different
patterns with an example of 6×6 feature maps, where the patch row and column
offset is (0, 0), (0, 1), and (2, 2), respectively.
4. XNOR-SRAM Input Interpreter: Depending on whether we operate the XNOR-
SRAM in 1-bit binary activation (+1/1) or multibit binary activation (+1/0)
mode, the XNOR-SRAM input interpreter will generate properWL inputs for
XNOR-SRAM array from what it receives from the rewiring logic.
4.5.4 Mapping of Convolution, Fully Connected, and Other Layers
For convolution layers, we propose a mapping scheme where the same location
pixel (x, y) of the kernel from all the kernels for different input/output feature maps
will be stored in the same XNOR-SRAM array. Other location pixels of the kernels
will be stored in different XNOR-SRAM arrays. This is shown in Fig. 4.8 (left). Then,
the XAC results will be gathered from multiple XNOR-SRAM arrays and accumulated
together to obtain the final output activation result. This scheme enables extensive
reuse of the activations, with weights being stationary at the XNOR-SRAM arrays.
78
Figure 4.8: Mapping convolution layers (left) and fully connected layers (right)
of deep CNNs onto the proposed sccelerator employing XNOR-SRAM macros with
in-memory computing.
Using the weight-stationary scheme in the XNOR-SRAM macros, it is straightfor-
ward to map fully connected layers of DNNs, where neurons/activations are in vectors
and weights are in matrices. This nicely maps to the row drivers for activations and
weights stored in the SRAM. For the fully connected layers whose size is larger than
256 64, we break the large weight matrix into a number of small submatrices and
accumulate the matrix-vector multiplication results accordingly. This is shown in Fig.
4.8 (right).
79
We implement other computation modules such as maxpooling, batch normal-
ization, and nonlinear activation with all-digital circuits, which will reside at the
periphery of XNOR-SRAM macros. The computing sequences of these modules are
as follows. Once the XAC operations are done inside the XNOR-SRAM array, they
are digitized with the flash ADC and output a thermometer code. The thermometer
code from each column gets converted to a binary number and a simple lookup table
(LUT) is used to bring it back to the exact bitcount value, according to the nonlinear
quantization scheme that was employed with regard to the ADC reference voltages.
Then, the same columns from different XNOR-SRAM arrays are accumulated (to
compute the summation of the partial sums of all convolution pixels), which then
sends out the final output sum value. Using the trained batch normalization parame-
ters, this final sum value goes through batch normalization and nonlinear quantization
(thresholding for binary/ternary activations and ReLU for multibit activations). Fi-
nally, when the current layers computation is completed, the activation outputs are
ready to serve as the input activations for the next layer of the DNN/CNN.
4.6 Experimental Results
4.6.1 Experiment Setup
The Vesti accelerator consists of two main parts: 1) multiple instances of XNOR-
SRAM arrays (including ADC peripheries) and 2) activation SRAMs and additional
digital logic/control. As reported in (Jiang et al. (2018)), the XNOR-SRAM array
itself consists of a custom SRAM array and peripheral mixed-signal circuits, which are
designed and laid out manually. The remaining modules, including activation SRAMs
and additional digital logic/control, are implemented through the standard cell-based
design flow. Activation SRAMs are off-the-shelf 6T SRAMs and are obtained from
80
industrial memory compiler. Digital logic/control modules are implemented in RTL,
synthesized using Synopsys Design Compiler, and placed and routed using Cadence
Innovus tool in the same 65-nm CMOS technology.
We characterized the power/energy consumption, throughput, and accuracy (for
MNIST and CIFAR-10 data sets) of the Vesti accelerator. For all the MAC or XAC
operations in the convolution and fully connected layers, the XNORSRAM testchip
measurement results from (Jiang et al. (2018)) are used. For all other digital logic
and off-the-shelf SRAMs, the power consumption results are obtained from Synopsys
PrimeTime simulation of the postlayout netlist with RC parasitics and actual data
switching activity information.
We considered MLPs and CNNs for image classification tasks for MNIST and
CIFAR-10 data sets, respectively. For MNIST, an MLP with three hidden layers, each
with 256 neurons, is used. The CIFAR-10 CNN architecture used in this section is
adopted from the CNN reported in (Hubara et al. (2016)), consisting of six convolution
layers and three fully connected layers, which is identical to the 1× CNN that was
used in Section V-B except that the last two convolution layers have 256 feature
maps. This represents the network of input-128C3-128C3-MP2-256C3-256C3-MP2-
256C3-256C3-MP2-1024FC-1024FC-10FC.
The ADC employed in (Jiang et al. (2018)) was a 11-level flash ADC, which
nonlinearly quantized the analog BL voltage to produce approximate partial XAC
values. Since the distribution of partial XAC values was found to be concentrated
around zero, finer-grain quantization has been employed around zero XAC values in
(Jiang et al. (2018)). The effect of noise/offset for the ADC, as well as the process
variation of XNOR-SRAM bitcells, contributes to the 3-σ spread of the analog read
BL voltage for the same XAC values in Fig. 4.3(d).
81
To evaluate the accuracy of binary-weight multibitactivation MLPs and CNNs on
Vesti, we first obtained a probabilistic model for the XNOR-SRAM XAC and quan-
tization operations from XNOR-SRAM chip measurements. Specifically, we used a
total of 656k (513 XAC values × 64 columns × 20 samples/XAC/column) random
test vectors and measured the outputs of the XNOR-SRAM to build XNOR-SRAMs
probabilistic model as a function of XAC value. We simulated BNN model accuracy
on Vesti by running software simulations where we stochastically quantized all the
256-input XAC partial sums according to the measured probabilistic model. Offset
calibration or other techniques that lower the ADC noise/offset can result in tighter
distribution of the probabilistic model, which will result in better DNN accuracy.
Note that the probabilistic model was characterized based on single-array measure-
ment. Array-to-array variation could potentially result in further accuracy loss, while
more accurate array-by-array ADC calibration could alleviate the loss. In addition,
hardware-variation-aware network retraining algorithm (Liu et al. (2015)) could po-
tentially help minimize the accuracy loss due to the array-to-array variation.
4.6.2 Area, Energy, and Throughput
In Fig. 4.9, the placed-and-routed layout of all digital peripheral blocks and the 72
XNOR-SRAM arrays are shown. The total area of the Vesti accelerator is 15 mm2 in
the 65-nm CMOS process. The width and height of different modules that comprise
the accelerator are shown as well. Multiple XNOR-SRAM arrays consume 54% of the
total area, the activation SRAMs consume 18%, and remainder of the area (28%) is
occupied by digital logic and control modules.
For MNIST MLP, we simulated and evaluated the total MLP energy for various ac-
tivation precisions of 13 bits in Fig. 4.10. The XNOR-SRAM macro energy is acquired
from chip measurements (Jiang et al. (2018)), and energy for other digital components
82
is obtained from postlayout simulation with data switching activity. Since activations
with N-bit precision consume N cycles to compute using the XNOR-SRAM array, the
overall energy roughly increases linearly with the activation precision. To perform a
single inference of the MLP using the 1-bit activation precision, the Vesti accelerator
consumes 21 cycles. At the 0.55-GHz clock frequency, which the Vesti accelerator can
operate at, the throughput of 26M inferences per second is achieved. Fig. 4.10 shows
the energy breakdown across various activation precisions.
For CIFAR-10 CNN, we also simulated and evaluated the total CNN energy for
various activation precisions of 13 bits for two CNN sizes (1 and 0.5, as described
in Section V-B) in Fig. 4.11. Since activations with N-bit precision consume N
cycles to compute using the XNOR-SRAM array, the overall energy roughly increases
linearly with the activation precision. To perform the single inference of the CIFAR-
Figure 4.9: Including the XNOR-SRAM prototype chip layout, the layout of activa-
tion memory buffer/controller, accumulation, and batch normalization modules are
shown.
83
Figure 4.10: Energy breakdown of the entire MLP designed for MNIST data set.
10 CNN using the 1-bit activation precision, the Vesti accelerator consumes 1676
cycles. At 0.55-GHz clock frequency, which the Vesti accelerator can operate at, this
marks the throughput of 328K inferences per second. Comparison with prior works
is summarized in Table 4.1.
4.7 Conclusion
In-memory computing for DNNs and CNNs has been recently gaining significant
attention. This is because memory access is the main bottleneck to scaling delay and
energy dissipation of the MAC operation in digital DNN/CNN accelerators. Most in-
SRAM computing works that turn on all rows or columns simultaneously (Biswas and
Chandrakasan (2018), Khwa et al. (2018), Jiang et al. (2018)) have only implemented
a single custom SRAM array, which only computes a small portion of total MAC
operations in DNNs. In such prior works, the rest of the operations and overall
architecture to implement the DNN accelerator have not been shown or implemented
in hardware.
84
Figure 4.11: Energy breakdown of the entire CNN designed for CIFAR-10 data set.
Two different size of CNNs (1× and 0.5×) and three different activation precision
schemes (1-3 bit) are shown.
In this work, we substantially expanded the single-array-level prior XNOR-SRAM
work (Jiang et al. (2018)) toward a configurable DNN accelerator architecture that
integrates 72 XNOR-SRAM arrays. The proposed Vesti architecture features: 1)
methodologies to efficiently load/map weights onto such XNOR-SRAM arrays for
convolutional layers and fully connected layers of DNNs; 2) multibit activation mem-
ory storage and control; 3) double-buffering technique to hide the latencies of repro-
gramming in-memory computing SRAM arrays; and 4) interarray communication.
Due to these comprehensive designs, Vesti simultaneously achieves both high ac-
curacy and low energy for representative DNNs that are benchmarked for MNIST
and CIFAR-10 data sets. The Vesti accelerator presented in this article features es-
sential techniques for in-SRAM computing-based deep learning processors, which can
fit under the stringent power/energy envelopes of mobile, wearable, and IoT devices.
85
T
a
b
le
4
.1
:
C
om
p
ar
is
on
w
it
h
P
ri
or
W
or
k
s
M
o
o
n
s
a
n
d
V
e
rh
e
ls
t,
2
0
1
6
Y
in
e
t
a
l.
,
2
0
1
7
T
h
is
W
o
rk
W
h
a
tm
o
u
g
h
e
t
a
l.
,
2
0
1
8
S
a
n
ch
o
a
n
d
K
e
rb
y
so
n
,
2
0
0
8
L
iu
e
t
a
l.
,
2
0
1
5
T
e
ch
.
4
5
n
m
2
8
n
m
6
5
n
m
2
8
n
m
4
5
n
m
4
0
n
m
M
o
d
e
l
S
N
N
C
o
n
v
C
o
n
v
/
M
L
P
M
L
P
S
N
N
C
o
n
v
V
o
lt
a
g
e
S
u
p
p
ly
0
.6
-0
.8
V
0
.6
-0
.8
V
1
V
0
.6
-1
.1
V
0
.6
-0
.8
V
0
.5
5
-1
.1
V
C
lo
ck
F
re
q
.
–
1
0
M
H
z
5
5
0
M
H
z
6
6
7
M
H
z
–
2
0
4
M
H
z
C
IF
A
R
1
0
A
c
c
u
ra
c
y
8
3
.4
1
%
8
6
.0
5
%
8
8
.6
%
–
–
–
C
IF
A
R
1
0
F
P
S
1
2
4
9
2
3
7
3
2
8
K
–
–
–
C
IF
A
R
1
0
E
n
e
rg
y
/
P
re
d
ic
ti
o
n
1
6
3
µ
J
3
.8
µ
J
2
3
.3
µ
J
–
–
–
M
N
IS
T
A
c
c
u
ra
c
y
–
–
9
8
.5
%
9
8
.5
%
9
9
.4
2
%
9
9
%
M
N
IS
T
F
P
S
–
–
8
.6
M
–
1
K
1
3
.4
K
M
N
IS
T
E
n
e
rg
y
/
P
re
d
ic
ti
o
n
–
–
0
.0
1
2
µ
J
0
.5
8
8
µ
J
1
0
8
µ
J
0
.4
5
µ
J
86
Chapter 5
ARCHITECTURE BENCHMARK OF NEURO-INSPIRED COMPUTING
SYSTEM
This chapter evaluates architecture benchmarks of neuro-inspired computing sys-
tem. State-of-the-art deep convolutional neural networks (CNNs) are widely used in
current intelligent systems, and achieve remarkable success in image/speech recogni-
tion and classification. A number of recent efforts have attempted to design custom
inference engine based on various approaches, including the systolic architecture, near
memory processing, and processing-inmemory (PIM) approach with emerging tech-
nologies such as resistive random access memory (RRAM). However, a comprehensive
comparison of these various approaches in a unified framework is missing, and the
benefits of new designs or emerging technologies are mostly based on qualitative pro-
jections.
In this chapter, we evaluate the energy efficiency and frame rate for a VGG-like
CNN inference accelerator on CIFAR-10 dataset across the technological platforms
from CMOS to RRAM, with hardware resource constraint, i.e. comparable on-chip
area. We also investigate the effects of off-chip memory DRAM access and inter-
connect during data movement, which are the bottlenecks of CMOS platforms. Our
quantitative analysis shows that the peripheries (ADCs) dominate in energy con-
sumption and area (rather than memory array) in RRAM-based parallel readout PIM
architecture. Despite presence of ADCs, RRAM-based parallel readout PIM architec-
ture shows >2.5× improvement in energy efficiency (TOPS/W) over systolic arrays
or near memory processing implemented with CMOS technologies, with a comparable
frame rate thanks to the reduced DRAM access, optimized parallel read out and high
87
throughput with pipeline system. Further >10× improvements can be achieved by
implementing bit-count reduced XNOR network and pipelining.
5.1 Introduction
Recently, the popularity of neuro-inspired computing has been utilized in a broad
range of applications and cloud services. In particular, deep convolutional neural
networks (CNNs) have shown remarkable breakthroughs in reducing error rate for
tasks ranging from speech recognition and image classification. These state-of-the-
art CNNs require a large number of high-dimensional convolutional layers to build
more accurate models, which could exceed hundreds of megabytes for kernel weight
storage and over tens of thousands operations for each input pixel. It is well known
that the convolutions dominate the operations and runtime in CNNs, and the energy
consumption of data movement could be even higher than that of computation, es-
pecially for the off-chip DRAM access. Due to the high requirement of bandwidth
and power consumption for intensive data processing and communication, the con-
ventional von Neumann platforms (e.g. CPUs/GPUs and/or FPGAs) are inadequate
to guarantee high performance and energy efficiency (Von Neumann (1981)). Thus,
it is crucial to design specialized hardware accelerators to enable a high-degree of
parallelism for deep learning algorithms.
A number of recent efforts have attempted to design custom inference engine
across various technological platforms, such as application-specific-integrated-circuit
(ASIC) based on silicon CMOS and emerging technologies, employing various data-
processing approaches. Several custom neuromorphic hardware accelerators, such as
Eyriss (Chen et al. (2016)), Envision (Moons et al. (2017)) and (Bong et al. (2017))
have been developed, while Google also reported an ASIC chip “TPU” (Jouppi et al.
(2017)), which employs 8-bit integer matrix multipliers and systolic array based data
88
flow, and achieved order-ofmagnitude reduction in energy and area compared to
GPUs. DaDianNao (Chen et al. (2014)) adopts a near memory processing approach
with embedded DRAM, where the neural functional units are fed with input and
weight data from nearby buffers, such that the energy could be saved by maximizing
internal bandwidth and reducing external data communications.
In most CMOS based accelerators, the inputs and weights are mainly stored in
SRAM buffers, which leads to low area efficiency (a single SRAM cell could occupy
>150F2, where F is the technology feature size) and high latency (governed by row-by-
row access). Therefore, a more area- and energyefficient approach called processing-in-
memory (PIM) is proposed, where the memory is not only used to store weight data,
but also used to perform embedded computation. To achieve even higher integration
density, the emerging nonvolatile memory (eNVM) devices (with cell size 4∼12F2)
are proposed for the PIM architectures (Yu (2018)).
Among the emerging technologies, the resistive random access memory (RRAM)
can naturally support the matrixvector multiplication efficiently by exploiting the
multi- conductance-state as analog synapses, with a crossbar structure. Recent de-
signs such as ISAAC (Shafiee et al. (2016)), PRIME (Chi et al. (2016)), and PipeLayer
(Song et al. (2017)) demonstrate the RRAM-based PIM is a promising solution for
high energy efficiency with limited onchip area. However, a comprehensive compari-
son among differential approaches such as digital systolic array, digital near-memory
computing, and analog in-memory computing with the same design assumptions and
constraints is still missing in the literature. Therefore, the trade-offs between inference
accuracy, latency, and energy across differential technological platforms are delusive.
For example, are the claimed orders of magnitude of improvements in performance
with analog in-memory computing really achievable in practice?
89
In this work, we aim to answer these key questions by performing a holistic com-
parison between three representative architectures: TPU-like systolic array (Kung
(1980)), near memory processing with SRAM, and processing-in-memory with bi-
nary or multi-bit RRAM by modifying a circuit-level simulator named NeuroSim
(Chen et al. (2018a)), including overhead of the off-chip DRAM access. We evaluate
the energy efficiency and frame rate of these architectures by using a VGG-like CNN
for inference on CIFAR-10 dataset.
5.2 Systolic and Near Memory Processing Architecture Design
5.2.1 Systolic Architecture Design
Fig. 5.1 shows the digital systolic architecture, the main components are the
systolic matrix-multiply array, accumulation, activation and pooling units. The global
buffers (and global control) are used to store input and output data, the systolic set-
up unit is used for arranging data in systolic flow. The systolic matrix-multiply
array is built up by an array of processing elements (PEs), which can perform 8-bit
multiplyand-accumulate (MAC) of signed or unsigned integers.
Figure 5.1: The diagram of conventional systolic architecture.
90
As Fig. 5.1 shows, after the weights are stored in corresponding PEs, the input
data will be read out from global buffer and arranged in the systolic setup unit,
and then be sent to the PE array from left to right horizontally (shown in blue
arrows) in different cycles. The corresponding partial sums from each PE will then be
transferred vertically (shown in green arrows) from top to bottom in different cycles,
and accumulated in the accumulation units to obtain the outputs. Subsequently, the
outputs will be sent to the activation and/or pooling units according to the global
control signals. The final outputs will be sent back to the global buffer and be used
as the neuron activations for next layer.
To achieve a fair comparison across different technological platforms, we enforce
an area constraint that all the architectures will have an upper-bound of area as that
of the RRAM based PIM architecture. Therefore, we limit the systolic PE array size
to be 256×128, with a 256KB SRAM global buffer to store the intermediate data.
Additional data fetch beyond the capacity of the global buffer will necessitate offchip
DRAM access, and cause additional energy overhead.
5.2.2 Near Memory Processing (NMP) Architecture Design with SRAM
In Fig. 5.2, SRAM-based near memory processing (NMP) architecture is shown.
The architecture is mainly build up by a group of neural functional tiles, a global
buffer that stores input and inter-layer neuron activations, and a global router that
sends instructions to each tile. Inside the neural functional tile, there are groups
of SRAM banks, which are only used to store weights, while the MAC, activation
and pooling units are used to operate neural computation. During the operation, the
router will generate instructions to each tile, and the input data or neuron activations
are sent from global buffer to the proper MAC units, while the weights are fetched
from adjacent SRAM banks which aim to save energy in weight data movement.
91
Thus, to achieve minimum weight data movement, we have to sacrifice most of the
chip area to store the synaptic weights. Therefore, the number of MAC, activation
and pooling units are limited, which means these neural functional units are shared
by multiple neuron activations in a time-multiplexed fashion. Similarly, due to the
area constraint, i.e. a comparable area with RRAM-based PIM architecture, we only
implement 1,152 SRAM banks, with each size equals to 128×128. There are 512
activation and pooling units in total, and the global buffer is set to be 128KB based
on register files. Due to the limited hardware resource, the NMP design also needs
data communication with off-chip DRAM, which will consume additional energy.
5.3 Processing-In-Memory Architecture Design Based on RRAM
5.3.1 Pseudo-Crossbar Array Structure
Fig. 5.3 shows the principle of using pseudo-crossbar array based on 1-transistor-
1-resistor (1T1R) structure to naturally perform analog matrix-vector multiplication
(Yu (2018)). As the 2-terminal selector technology is premature for large-scale inte-
gration, using a transistor in each bit-cell could effectively eliminate the sneak paths.
Figure 5.2: The diagram of near memory processing (NMP) architecture, where the
SRAM banks are used to save weight data.
92
Figure 5.3: The diagram of pseudo-crossbar array, which perform analog matrixvec-
tor multiplication by accumulating currents through source-lines (SLs) naturally.
To perform the parallel computation, all the word lines (WLs) are enabled simulta-
neously. The bit lines (BLs) are used to apply input voltages, which represent the
input vectors, and the RRAM cells are used to represent “analog” synaptic weights
by exploiting the multi-level conductance states (Yu (2017)). Therefore, the dot-
product values will be the current passing through each RRAM cells, and the final
weighted-sums will be accumulated in each source lines (SLs) vertically. To read out
the weighted-sums and further process them in ensuing logic modules (such as acti-
vation and pooling), it requires an analog-to-digital converter (ADC) at the end of
SLs to generate digital outputs. Due to the relatively large size of ADC, and the cell
size of 1T1R array is much smaller, it is not area-efficient to put all the read periph-
eries (e.g. ADCs and accumulation circuits) underneath the 1T1R array, thus, it is
necessary to use a Mux and Mux Decoder to share the read peripheries among mul-
tiple SLs, the Mux Decoder will activate a group of Mux to be connected to the read
peripheries and read out the weighted-sum results, and then activate another group
to repeat the operation until all the SLs are read out. However, this will increase the
latency as time multiplexing is needed because of SL-sharing.
93
Figure 5.4: A mapping method of input data and kernels in convolutional layers to
the crossbar array.
5.3.2 Mapping Kernels in Crossbar Arrays
In the PIM architectures, the 3D element-wise convolution has to be transformed
to a matrix-vector multiplication, and the kernels will be mapped into crossbar arrays
as conductance.
A straightforward method to map the kernels is to unroll each 3D kernel into a
long vertical column, since the partial sums in a 3D kernel will be finally summed up
to get the output. Fig. 5.4 shows this conventional kernel-mapping strategy (Gokmen
et al. (2017)). In layer<i>, the size of input feature maps (IFMs) is W×W×D, and
the kernel size is K×K×D with a depth of N, if we use zeropadding and a stride of
1 (to ensure the output has same 1st and 2nd dimensional size as input), the size of
output feature maps (OFMs) will be W×W×N. Each K×K×D kernel will be unrolled
and mapped as conductance in a long SL in the crossbar array, so that there will be
N such SLs in total.
94
To generate the OFMs, as Fig. 5.4 shows, the IFMs can be applied to the crossbar
array in the same way as “sliding over” kernels in the CNN algorithm. At each cycle,
a part of IFMs (shown as light to dark orange cubes representing groups of IFMs
in different cycles) will be unrolled and applied to the BLs, and multiplied with all
kernels, then the weighted sums will be accumulated through the SLs. Therefore, the
OFMs are generated along the 3rd dimension (across channels) at each cycle, and to
obtain the final OFM values, we need to process W×W cycles.
In deep CNNs, since the size/depth of IFMs and OFMs could be large (thousands
by thousands), using a single crossbar array with extremely large size to implement
one convolutional layer could cause long latency and extra energy. Therefore, array
partitioning (Chen and Yu (2016)) is introduced to parallelize the computation into
multiple sub-arrays, and the partial sums from each subarrays could be summed up
in extra accumulation units (e.g. adder tree) accordingly.
5.3.3 Chip Architecture Design
Fig. 5.5 shows the RRAM-based PIM architecture. There are groups of PIM
tiles. Similar to the SRAM-based designs, there is a global buffer to store input and
neuron activations, the accumulation, activation and pooling units to support neural
computation, and a global control unit to generate proper instructions to PIM tiles.
Inside each tile, there are multiple subarrays to store weights and compute matrix-
vector multiplication. The input register is used to load in input data from global
buffer, while the accumulation and output register are for summing up partial sums
from sub-arrays and store the final outputs, which will be sent back to the global
buffer.
To implement a Y-bit analog synaptic weight, we could use N× M-bit cells (where
Y=N×M) as a group, such that each column in the group could represent from LSB
95
Figure 5.5: The diagram of processing-in-memory (PIM) architecture based on
RRAM technology.
to MSB of the partial sums. In this work, we choose three different RRAM cell
precision to implement 8-bit weights, which are 8 1-bit cells, 4× 2-bit cells and 2× 4-
bit cells. The 8-bit fixed-point neuron activations are represented as eight sequential
input voltages through eight cycles. For each row, if the input vector bit is 1, then
the row will be selected for weighted-sum operation (read out), otherwise the row will
be skipped. To access all the cells on the selected row, the WLs are activated through
the WL decoder or switch matrix, and we use the shift-add and register modules to
shift and accumulate the partial sums of the 8-bit sequential inputs. Therefore, we
could eliminate the challenges of using digital-to-analog converter (DAC) to represent
the analog voltage in a small range and avoid the inaccuracy introduced by RRAM
nonlinear I-V relation (Chen et al. (2015)).
96
Figure 5.6: Classification accuracy of CIFAR-10 for an 8-bit CNN as a function of
the ADC precision for partial sums.
For RRAM-based design, there are two read-out schemes. A sequential processing
method of the matrix-vector multiplication is to read out the dot-products in a row-
by-row manner, which leads to extra energy and latency for accumulations along the
rows. A more efficient method is parallel processing, where multiple rows are activated
simultaneously by a switch matrix, and the current summation will be read out by
an ADC. Therefore, the row-by-row accumulation periphery of sequential scheme is
eliminated. However, since it is impractical to use very high-precision ADC at the
edge of RRAM sub-arrays, we have to truncate the precision of ADC (for partial
sums) to minimize the area and energy overhead.
As Fig. 5.6 shows, we perform 8-bit inference of a 9-layer VGG-like CNN algo-
rithm on CIFAR-10 dataset, to investigate the effects of truncating ADC precision on
the classification accuracy. We set the sub-array size to be 128×128, and investigate
three schemes with 1-bit cell, 2-bit cell and 4-bit cell. To minimize the ADC trun-
97
cation effects on the partialsums, we utilize the nonlinear quantization with various
quantization edges (corresponding to different ADC precision), where the edges are
determined according to the distribution of partial-sums, as proposed in (Zhu and
Ramanan (2012)). Compared to the baseline accuracy (no ADC truncation), the
results suggest that at least 4-bit ADC is required to prevent significant accuracy
degradation. Compared to a prior work on binary neural network where 3-bit ADC
was reportedly sufficient (Sun et al. (2018)), the results in Fig. 5.6 surmise that higher
weight-precision generally requires higher ADC-precision.
5.4 Benchmarking Across Architectures
As a comprehensive benchmarking, we include the digital systolic architecture,
SRAM-based NMP architecture, RRAMbased PIM architectures. RRAM-based PIM
architectures include sequential and parallel designs using 1-bit cell, 2-bit cell and 4-
bit cell RRAMs. We take a 9-layer VGG-like CNN as the case-study algorithm, to
perform inference with the selected designs with hardware resource constraint, i.e.
comparable area to the RRAM-based design, and evaluate the energy efficiency and
frame rate at 40nm CMOS node, where the integration of RRAM at back-end-of-line
is feasible from the industrys perspective (Chou et al. (2018)).
As Table 5.2 shows, the size of the input RGB image is 32×32, and there are
6 convolutional layers, 3 max-pooling layers (where every two convolutional layers
are followed by one max-pooling layer) and 3 fully-connected layers. The activation
function is set as ReLU after each layer. We consider two different precision schemes
for weights/activations: one is 1-bit (i.e. XNOR-Net (Rastegari et al. (2016))), and
the other is 8-bit fixed-point precision (as used in TPU (Jouppi et al. (2017))).
98
Table 5.1: Architecture Properties
Component Spec. @ 40nm
Energy /
Inference
Area
(mm2)
Conventional Systolic Architecture @ 1GHz
Systolic
Array
Size 128×128
308µJ 27.02
Number 2
Buffer
Size 256KB
3.81µJ 1.6
Device SRAM
Activation
& Pooling
Precision 8-bit
6.2µJ 0.148
Number 512
Local
Register
Size 64B
0.35µJ 0.03
Device Register
Near-Memory-Processing SRAM Architecture @ 1GHz
SRAM
Array
Size 128×128
292.7µJ 14.13
Number 1152
Multiplier
Precision 16-bit
77.95µJ 4.56
Number 1152
Accumulator
Max Output Bit 29-bit
60.22µJ 1.66
Number 1152
Activation
& Pooling
Precision 8-bit
6.2µJ 0.148
Number 1152
Buffer
Size 128KB
0.13µJ 5.92
Device Register
Processing-In-Memory RRAM Architecture @ 1GHz
RRAM
Array
Size 128×128 2.16µJ/
1.1µJ/
0.54µJ
8.73/
4.60/
2.27
Cell Precision 1-bit/2-bit/4-bit
Number 6853/3427/1714
ADC
Precision 4-bit/5-bit/6-bit 226µJ/221µJ
/215µJ
9.13/12.52
/15.63Number 8SLs share 1
Accumulation
Units
Max Output Bit 19-bit/20-bit/21-bit 31µJ/19µJ
/17µJ
11.9/6.36
/3.32Number 512
Activation
& Pooling
Precision 8-bit
0.01µJ 0.071
Number 512
Buffer
Size 128KB
0.3µJ 5.92
Device Register
99
Table 5.2: VGG-like CNN Layer Configuration
Layer # Type IFM Dim.a OFM Dim.a Kernel Size
1 Conv. (3,32,32) (128,32,32) (128,3,3,3)
2 Conv. (128,32,32) (128,32,32) (128,128,3,3)
Pool (128,32,32) (128,16,16)
3 Conv. (128,16,16) (256,16,16) (256,128,3,3)
4 Conv. (256,16,16) (256,16,16) (256,256,3,3)
Pool (256,16,16) (256,8,8)
5 Conv. (256,8,8) (512,8,8) (512,256,3,3)
6 Conv. (512,8,8) (512,8,8) (512,512,2,2)
Pool (512,8,8) (512,4,4)
7 FC. (8192) (1024) (1024,8192)
8 FC. (1024) (1024) (1024,1024)
9 FC. (1024) (10) (10,1024)
a IFM/OFM Dim.: dimension of input feature map (IFM) or output feature map (OFM).
5.4.1 Experiment Setup
Table 5.1 shows the module-level properties for the benchmarked designs, where
the RRAM-based PIM designs are shown as parallel schemes. The parameters of dig-
ital systolic array and NMP with SRAM are obtained from post-synthesis simulation
results using TSMC 40nm CMOS PDKs. The parameters of RRAM-PIM design are
obtained from NeuroSim simulator (Chen et al. (2018a)) with technology parameters
for typical HfO2 RRAM, with 0.5V read voltage, 100kΩ and 10MΩ as the Ron and
Roff .
The energy is shown as energy per inference, which refers to the energy of each
module to process one image inference (passes the total layers in the CNN) of each
architecture. The accumulation units (or accumulator) are mainly built up by multi-
stage adder trees, such that multiple partial-sums from various MAC or sub-array
cores (which in the same bit-level) can be accumulated through the adder trees. It
is well known that, with various feature maps and kernel sizes of each layer, the
number of cores that used to do the computation differ a lot, thus, the accumulator
100
in SRAM architecture and accumulation units in RRAM architecture are shown with
the maximum output bit, which guarantee the maximum bit-precision for partial-sum
accumulation through the whole network. The maximum output bit of accumulator
in SRAM architecture is 29-bit, while in RRAM architectures, the maximum output
bits are much lower, i.e. 19-bit/20-bit/21-bit (for different RRAM cell precision),
this is because of the ADC truncation effects on the partial-sums of RRAM schemes
(while the computation in SRAM architecture is still full precision). It should be noted
that, in RRAM architectures, the flash-ADCs are built up by multiple sense amplifiers
(SAs) to generate digital outputs with fixed-point precision and thermometer encoders
to convert the digital outputs to the final partial sums (Chen et al. (2015)). With
higher ADC precision (corresponding to higher RRAM cell precision), the number of
sub-array decreases, and therefore, leads to slightly less total energy consumption,
however, the total area increases because of the area overhead in higher-precision
encoders. Similarly, the area and energy consumption of accumulation units also
decreases with higher RRAM cell precision, because of less number of sub-arrays.
The energy for off-chip DRAM is assumed to be 4.6pJ/bit/operation (Gao et al.
(2017)) through all the benchmarks.
The assumption here is that the on-chip area is just sufficient to store all the
weights of the network into RRAM array. This avoids the expensive RRAM write op-
erations for reloading the weights. With the area constraint (comparable to RRAM),
it is clear that both the digital systolic array and SRAM-based NMP have to share
the neural functional modules inter- or even intralayers, and needs to frequently load
weights from off-chip DRAM, which makes it very difficult to design a pipelined
system. However, taking the advantage of a highly compact implementation, the
proposed RRAM-based PIM architectures could be further optimized by pipelining,
to improve the hardware performance.
101
Figure 5.7: Example of pipeline in RRAM architecture.
According to the assumed mapping method and corresponding data flow, it can
be considered that the size of IFMs determine the processing speed of each layer.
For example, if the IFM size of layer<i> is M×M, with same- padding and a stride
of 1, it requires M×M cycles in total to load in IFMs and generate OFMs. In the
VGG-like CNN, the first two layers have the largest IFM size (32×32), and after the
maxpooling layer, the IFM size in layer<3> and layer<4> is only 16×16, leading to
at least 4× reduction in processing time. In other words, after a pooling layer, the
processing time could be reduced by 4×.
Therefore, to design a balanced pipeline system for the RRAM architectures,
we divide these 9 layers into 3 pipeline stages, which are layer<1>, layer<2>, and
layer<3∼9>. In this case, the pipelined RRAM architectures can process at most 3
images simultaneously, which could help to improve the throughput by 3×. As Fig.
5.7 shows, when the layer<3∼9> is processing image<1> (shown as yellow block
in first row), layer<2> is processing image¡2¿ (shown as blue block in second row),
and layer<1> is processing image<3> (shown as green block in third row). This
pipelined system requires extra buffers to store results from layer<1> and layer<2>,
102
to avoid over-writing or miss-reading of data among the three pipeline stages. How-
ever, because the depth of OFMs in the first two layer is relatively small compared
with the deeper layers, the overhead of area and energy caused by the extra buffer is
affordable.
5.4.2 Benchmark Results and Discussion
Fig. 5.8 compares the systolic and NMP architectures with pipelined parallel
RRAM architectures. It shows that in systolic and NMP architectures, the DRAM
access is dominant in energy (∼53%). This is caused by limited on-chip memory
capacity, as the weights have to be reloaded frequently, and the neuron activations
from several larger or deeper layers may also be transferred between on-chip memory
and off-chip DRAM, i.e. the output from layer<i> is too large to be stored on-chip,
so the output have to be sent back to the off-chip DRAM and reloaded to the on-
Figure 5.8: Energy breakdown of systolic architecture, NMP architecture and
pipelined parallel RRAM architectures.
103
chip memory to start the operation of next layer<i+1>. It should be noticed that,
even though the NMP architecture utilize the SRAM arrays to store the weight, due
to the area-constraint which limits the on-chip storage of SRAM arrays, the NMP
architecture cannot store all the weights of the whole CNN, so reloading of weight is
also necessary. However, since the weight data are stored near to the functional units,
instead of transferring weight data far from global buffer (as systolic architecture),
this NMP could help to save the energy of interconnect for on-chip data communi-
cation. In RRAM architectures, the interconnect energy decreases when the RRAM
cell precision increases. This is because, with higher precision cell, the number of
sub-arrays decreases, so as the chip area, and the total wire length for interconnects
among subarray and other computational units also decrease, thus the interconnect
energy reduces. However, the dynamic energy through these three schemes does not
vary too much, since ADC dominates the total energy consumption. Higher precision
ADC are required in sub-arrays with higher precision RRAM cells, to maintain high
classification accuracy.
Fig. 5.9 shows the average energy consumption for one image processing of selected
RRAM-based architectures with pipelining and without pipelining (i.e., layer-by-layer
operation), where “S-RRAM” means sequential read-out scheme, and “P-RRAM”
means parallel read-out scheme.
The energy results are broken down into four categories of interconnect, DRAM,
leakage and dynamic energy (of the computation). It is clearly shown that, the dy-
namic energy dominates, since highly area-efficient RRAM architectures can hold all
the weights on-chip, and activations could be reused on-chip through buffers, thus the
only requirement from offchip DRAM is to load in the input image data. However, the
leakage energy is relatively large (the second dominant factor) in sequential RRAM
architectures, since the sequential schemes require much longer time to read out and
104
Figure 5.9: Sequential and parallel RRAM architectures with and without pipelin-
ing, for cell-precision of 1-bit, 2-bit and 4-bit.
accumulate the dotproducts in a row-by-row manner. We assumed the sub-array size
to be 128×128 in this work, so the operating latency of sequential schemes is at least
128× larger than the parallel schemes. In both layer-by-layer or pipelined designs,
the parallel schemes can greatly reduce leakage energy consumption by efficiently op-
erating the matrix-vector multiplication in terms of latency. Moreover, the energy
of interconnect and DRAM in pipelined designs are slightly larger than the one of
layer-by-layer designs, this is because that with a same time period, the pipelined
architectures have to access to the off-chip DRAM roughly two more times compared
with the layer-by-layer architectures, to load in another two images into the 3-stage
pipeline system.
Furthermore, in the layer-by-layer design, which means for a current layer<i>
which process the image<j>, the next layer<i+1> should not start until all the
105
Figure 5.10: Area breakdown of (a) systolic architecture; (b) NMP architecture,
and (c) pipelined parallel RRAM architecture (4-bit/cell).
outputs are generated from current layer<i>; also, and the former layers cannot load
in new image<j+1> until all the layers for current image<i> finished. In this way,
some tiles are operating (for current layer<i>) while other tiles are idle (all the other
layers), which will cause extra leakage. On the other hand, the pipelined design can
maximize the tile utilization and minimize the leakage, thus improving the hardware
performance.
Fig. 5.10 shows the area breakdown of periphery in systolic, NMP and pipelined
parallel RRAM (with 4-bit cell-precision) architectures. In systolic architecture, the
systolic array and neural functional units are dominant in total area, while in the NMP
architecture, most of the area are used by SRAM to store the synaptic weights. The
multipliers and buffers in NMP architecture also consume a large percentage of area,
since they are very important to determine the throughput of the system. If there
are more multipliers and buffers provided for computation and data communication,
the convolutions can be processed faster.
In RRAM architectures, it can be seen that the memory efficiency is quite low
(only 6.35%) compared to conventional memory design. This is because the peripheral
circuits consume much larger area compared to the memory array itself, especially
106
the ADC dominates the total area when the ADC precision is relatively high (6-bit
in 4-bit-cell design). The buffers consume similarly large area as well, to guarantee
enough storage space to operate in pipeline.
In Table 5.3, we summarize the area, energy efficiency and frame rate for all
the selected architectures (where the areas are not exactly same due to practical
constraints). Besides the 8-bit CNN, we also evaluate the hardware performance for
binary neural network, i.e. XNOR-Net inference with same algorithm configuration.
As is known, using low precision weights/activations could greatly boost the energy
efficiency by sacrificing slightly degraded classification accuracy (∼0.4% degradation
for CIFAR-10 dataset).
According to the comprehensive benchmark results for 8-bit CNN inference, com-
pared with sequential RRAM architectures, the parallel schemes can significantly
improve energy efficiency and frame rate, by avoiding multiple activations of periph-
ery (to activate WLs separately and accumulate partial sums through multiply cycles,
etc.). The sequential RRAM architectures in fact exhibit lower energy efficiency and
frame rate compared to systolic and NMP with SRAM architectures.
With the parallel read-out scheme, the RRAM architectures could achieve at least
2.5× energy efficiency improvement comparing to systolic and NMP architectures.
Still the systolic architecture is the winner in frame rate, because to processing 8-bit
computation, RRAM architectures have to spend 8 cycles to load in the 8-bit neuron
activations, and also due to the SLsharing (in the benchmark, we assume 8 SLs share
1 ADC and following accumulation peripheries), which cause another 8× deceleration
of speed, thus there is a tremendous waste in latency. However, by introducing the
3-stage-pipeline in the parallel RRAM architectures, the throughput is effectively
improved by 3×, while the energy efficiency also increase a bit by minimizing on-chip
leakage.
107
It is clear that, in parallel RRAM architectures, the ADCs dominate in area
and energy, and this is related to the fact that ADC precision directly affects the
classification accuracy. Considering the tradeoffs among all the metrics (area, energy
and accuracy), an optimal design option could be found. For example, in the 8-
bit CNN inference engines, to guarantee the classification accuracy with sub-array
size to be 128×128, the ADC precision should be at least 4-bit (or 5-bit and 6-bit)
for 1-bit/cell (or 2-bit/cell and 4-bit/cell) design. However, despite these area- and
energy-dominant high-precision ADCs, the pipelined parallel RRAM architectures
are still superior to systolic and NMP architectures in area, energy and frame rate.
Table 5.3: Benchmark Results
XNOR Area (mm2) TOPS/W FPS
NMP-SRAM 2.972 10.1 2900
S-RRAM 4.22 18.08 363
P-RRAM 3.45 140.1 15987
S-RRAM-Pipeline 5.29 19.6 947
P-RRAM-Pipeline 5.42 141.017 33833
8-bit CNN Area (mm2) TOPS/W FPS
Systolic Array 28.8 1.27 7329
NMP-SRAM 26.425 1.28 2900
S-RRAM (1-bit/cell) 33 0.51 46
S-RRAM (2-bit/cell) 21 0.42 45
S-RRAM (4-bit/cell) 16 0.31 43
P-RRAM (1-bit/cell) 36 3.68 2211
P-RRAM (2-bit/cell) 30 4.24 2364
P-RRAM (4-bit/cell) 27 4.6 2428
S-RRAM-Pipeline (1-bit/cell) 41.4 0.6 120
S-RRAM-Pipeline (2-bit/cell) 29.3 0.45 117
S-RRAM-Pipeline (4-bit/cell) 24.2 0.32 114
P-RRAM-Pipeline (1-bit/cell) 44.6 3.75 5524
P-RRAM-Pipeline (2-bit/cell) 38 4.28 5917
P-RRAM-Pipeline (4-bit/cell) 35.7 4.62 6080
108
Furthermore, a recent algorithmic work (Rastegari et al. (2016)) demonstrates
that high-precision MACs can be replaced by bit-wise XNOR and bit-counting oper-
ations but still achieves satisfying accuracy on image classification, where both the
weights and neuron activations are downsized to -1 or +1. It further helps to opti-
mize the design options of RRAM architectures by decreasing ADC precision, since
lower precision scheme tends to have less fine-grain results and require lower ADC
precision.
For the benchmarks on binary neural network, we implement the XNOR-RRAM
(Sun et al. (2018)) architectures with 128×128 sub-arrays based on 1-bit RRAM
and activations, where the ADC precision is 3-bit (to maintain >86% accuracy).
The NMP architecture are built up with XNOR digital logic arrays (instead of 8-bit
multipliers and accumulations as for 8-bit CNN). The results show that the parallel
RRAM architectures could achieve at least 14× energy efficiency and 5× frame rate
compared with NMP architecture. This is because the XNOR downsizes the weights
and activations to 1-bit, which means the RRAM architectures do not need to waste
extra energy and latency (as for multi-bit activations), and furthermore, the lower
ADC precision also leads to significant energy reduction.
5.5 Conclusion
In this chapter, we benchmark three inference engines based on CMOS and post-
CMOS technologies, which are systolic architecture, NMP architecture with SRAM,
and PIM architectures with RRAM. We use a VGG-like CNN as the case-study
algorithm to evaluate 8-bit and XNOR inference with the area constraint.
The quantitative analysis shows that the parallel RRAM architectures with pipelin-
ing can achieve at least 2.5× energy efficiency over systolic or NMP architectures, with
a comparable frame rate for 8-bit inference engine. It also shows an even superior
109
performance in XNOR inference engine compared to the NMP architecture. However,
parallel RRAM architectures will lead to a slight degradation (∼0.4%) of classifica-
tion accuracy due to limited precision of ADC for partial sums. It can be considered
that, the design of areaefficient and low-power ADC is the breakthrough point of PIM
architectures, and when the classification accuracy could survive with lower-precision
networks, the PIM architectures based on compact non-volatile memory technology
will become to an even more promising solution for high-efficient neural network
accelerators.
The results reveal that the RRAM based PIM architectures succeed because of:
1) the dense memory array which save area for peripheries; 2) minimum off-chip
DRAM access with sufficient on-chip storage; 3) efficient operation of matrixvector
multiplication by parallel read-out with the characteristic of crossbar array; 4) opti-
mized design options by decreasing the ADC precision according to its trade-offs with
accuracy degradation; 5) possibility to design high-dimensional pipeline system.
To sum up, this work provides a holistic benchmark based on technology-practical
assumptions and parameters, revealing the pros and cons of the systolic architecture,
SRAM based NMP architecture, and emerging RRAM based PIM architectures. Fu-
ture work may extend to the scalability of RRAM to more advanced nodes, and study
the impact of the process variations.
110
Chapter 6
SUMMARY
This work presented a comprehensive study of energy-efficient hardware accelera-
tor implementation for machine/deep learning algorithms and emerging technologies
such as in-memory computing. This work demonstrated the essential task of special-
purpose ASICs to achive energy-efficient, real-time, and high accurate performance
on energy-constrained hardware platforms such as mobile, wearable, and Internet of
Things (IoT) devices. This work demonstrated the proposed energy-efficient ASIC
accelerators with the fabrication of proto-type chips at 40nm and 65nm CMOS tech-
nologies, adapted optimization strategies. This work also presented the inference
engine benchmarking from CMOS to emerging technology and the overall DNN ac-
celerator using in-memory computing architecture. The main contributions of this
work are:
1. Energy-efficient hardware accelerator implementation for machine
learning: This work presented a 65nm accelerator for real-time programmable
object detection, employed HeadHunter model based on a set of five rigid tem-
plates with 2,000 Adaboost weak classifiers to make a strong object classifica-
tion. High average precision of 0.88, 0.81, 0.76, 0.72 and 0.54 was achieved in
FDDB, AFW, Caltech car plate, BTSD, and INRIA person datasets, respec-
tively. The accelerator achieved 0.54/1.75 nJ/pixel while consuming 22.5/181.7
mW at 0.58/1.1 V with 20/50 fps in full HD videos, respectively. The capa-
bility of programmable and voltage-/performance-scalable prototype chip will
enhance smart vision processors in ubiquitous mobile systems.
111
2. Hardware accelerator for deep learning using the novel conditional
computing scheme: This work also presented a 40nm energy-efficient ac-
celerator for deep convolution neural network. This work proposed precision-
cascading scheme to reduce redundant convolutional operations due to max
pooling. In addition, integrating the precision-cascading with fully zero-skipping
by exploit zero data, we achieved significan reduction of energy and external
memory accesses. The accelerator achieved a peak 7.35 TOPS/W and an av-
erage 1.01 TOPS/W for VGG-16 convolution layers in ILSVRC2012 valid sets
while consuming 245.36 mW at 0.9V. The proposed convolution loop acceler-
ation strategy with fully zero skipping scheme reduced the number of off-chip
memory access by overall 2.12× in VGG-16 convolution layers.
3. Implementation of the overall DNN accelerator using in-memory com-
puting SRAM: This work expanded the single-array-level in-SRAM work
towrad a configurable DNN accelerator architecture that integrates 72 XNOR-
SRAM arrays. The proposed architecture features: 1) methodologies to effi-
ciently load/map weights onto such XNOR-SRAM arrays for convolutional lay-
ers and fully connected layers of DNNs; 2) multibit activation memory storage
and control; 3) double-buffering technique to hide the latencies of reprogram-
ming in-memory computing SRAM arrays; and 4) interarray communication.
Due to these comprehensive designs, this work simultaneously achieved both
high accuracy and low energy for representative DNNs that are benchmarked
for MNIST and CIFAR-10 data sets. The proposed accelerator features essen-
tial techniques for in-SRAM computing-based deep learning processors, which
can fit under the stringent power/energy envelopes of mobile, wearable, and IoT
devices.
112
4. Comprehensive comparison by performing a holistic comparison of
neuro-inspied computing system: This work benchmarked three infer-
ence engines based on CMOS and emerging technologies, which are systolic
architecture, near-memory processing (NMP) architectures with SRAM, and
processing-in memory (PIM) architectures with RRAM, by using a VGG-like
CNN. The quantitiative analysis showed that the PIM architectures are more
energy-efficient than the systolic and NMP architecture because of: 1) the dense
memory array which save area for peripheries; 2) minimum off-chip DRAM
access with sufficient on-chip storage; 3) efficient operation of matrixvector
multiplication by parallel read-out with the characteristic of crossbar array;
4) optimized design options by decreasing the ADC precision according to its
trade-offs with accuracy degradation; 5) possibility to design high-dimensional
pipeline system.
113
REFERENCES
Advani, S., Y. Tanabe, K. Irick, J. Sampson and V. Narayanan, “A scalable architec-
ture for multi-class visual object detection”, in “2015 25th International Conference
on Field Programmable Logic and Applications (FPL)”, pp. 1–8 (IEEE, 2015).
Albericio, J., P. Judd, T. Hetherington, T. Aamodt, N. Jerger and A. Moshovos,
“Cnvlutin: Inefectual-neuron-free deep neural network computing. in 2016 acm/ieee
43rd annual international symposium on computer architecture (isca). 1s´13. htps”,
doi. org/10 1109 (2016).
Bacon, D. F., S. L. Graham and O. J. Sharp, “Compiler transformations for high-
performance computing”, ACM Computing Surveys (CSUR) 26, 4, 345–420 (1994).
Benenson, R., M. Mathias, R. Timofte and L. Van Gool, “Pedestrian detection at 100
frames per second”, in “2012 IEEE Conference on Computer Vision and Pattern
Recognition”, pp. 2903–2910 (IEEE, 2012).
Biswas, A. and A. P. Chandrakasan, “Conv-ram: An energy-efficient sram with em-
bedded convolution computation for low-power cnn-based machine learning appli-
cations”, in “2018 IEEE International Solid-State Circuits Conference-(ISSCC)”,
pp. 488–490 (IEEE, 2018).
Bong, K., S. Choi, C. Kim, S. Kang, Y. Kim and H.-J. Yoo, “14.6 a 0.62 mw ultra-low-
power convolutional-neural-network face-recognition processor and a cis integrated
with always-on haar-like face detector”, in “2017 IEEE International Solid-State
Circuits Conference (ISSCC)”, pp. 248–249 (IEEE, 2017).
Burr, G. W., R. M. Shelby, S. Sidler, C. Di Nolfo, J. Jang, I. Boybat, R. S. Shenoy,
P. Narayanan, K. Virwani, E. U. Giacometti et al., “Experimental demonstration
and tolerancing of a large-scale neural network (165 000 synapses) using phase-
change memory as the synaptic weight element”, IEEE Transactions on Electron
Devices 62, 11, 3498–3507 (2015).
Caltech, “Car dataset for car license plate detection”, in “URL
http://www.vision.caltech.edu/archive.html”, (2001).
Chen, C.-Y., M. Q. Le and K. Y. Kim, “A low power 6-bit flash adc with reference
voltage and common-mode calibration”, IEEE Journal of solid-state circuits 44, 4,
1041–1046 (2009).
Chen, P.-Y., D. Kadetotad, Z. Xu, A. Mohanty, B. Lin, J. Ye, S. Vrudhula, J.-s.
Seo, Y. Cao and S. Yu, “Technology-design co-optimization of resistive cross-point
array for accelerating learning algorithms on chip”, in “Proceedings of the 2015
Design, Automation & Test in Europe Conference & Exhibition”, pp. 854–859
(EDA Consortium, 2015).
114
Chen, P.-Y., X. Peng and S. Yu, “Neurosim: A circuit-level macro model for bench-
marking neuro-inspired architectures in online learning”, IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems 37, 12, 3067–3080
(2018a).
Chen, P.-Y. and S. Yu, “Partition sram and rram based synaptic arrays for neuro-
inspired computing”, in “2016 IEEE International Symposium on Circuits and
Systems (ISCAS)”, pp. 2310–2313 (IEEE, 2016).
Chen, W.-H., K.-X. Li, W.-Y. Lin, K.-H. Hsu, P.-Y. Li, C.-H. Yang, C.-X. Xue,
E.-Y. Yang, Y.-K. Chen, Y.-S. Chang et al., “A 65nm 1mb nonvolatile computing-
in-memory reram macro with sub-16ns multiply-and-accumulate for binary dnn
ai edge processors”, in “2018 IEEE International Solid-State Circuits Conference-
(ISSCC)”, pp. 494–496 (IEEE, 2018b).
Chen, X., J. Xu and Z. Yu, “A 68 mw 2.2 tops/w low bit-width and multiplierless
dcnn object detection processor for visually impaired people”, IEEE Transactions
on Circuits and Systems for Video Technology (2018c).
Chen, Y., T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun
et al., “Dadiannao: A machine-learning supercomputer”, in “Proceedings of the
47th Annual IEEE/ACM International Symposium on Microarchitecture”, pp. 609–
622 (IEEE Computer Society, 2014).
Chen, Y.-H., T. Krishna, J. S. Emer and V. Sze, “Eyeriss: An energy-efficient re-
configurable accelerator for deep convolutional neural networks”, IEEE Journal of
Solid-State Circuits 52, 1, 127–138 (2016).
Chi, P., S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang and Y. Xie, “Prime: A
novel processing-in-memory architecture for neural network computation in reram-
based main memory”, in “ACM SIGARCH Computer Architecture News”, vol. 44,
pp. 27–39 (IEEE Press, 2016).
Chou, C.-C., Z.-J. Lin, P.-L. Tseng, C.-F. Li, C.-Y. Chang, W.-C. Chen, Y.-D. Chih
and T.-Y. J. Chang, “An n40 256k× 44 embedded rram macro with sl-precharge sa
and low-voltage current limiter to improve read and write performance”, in “2018
IEEE International Solid-State Circuits Conference-(ISSCC)”, pp. 478–480 (IEEE,
2018).
Davis, J. and M. Goadrich, “The relationship between precision-recall and roc
curves”, in “Proceedings of the 23rd international conference on Machine learn-
ing”, pp. 233–240 (ACM, 2006).
Dolla´r, P., Z. Tu, P. Perona and S. Belongie, “Integral channel features”, (2009).
Everingham, M., L. Van Gool, C. Williams, J. Winn and A. Zisserman, “The pascal vi-
sual object classes challenge 2012 (voc2012) results (2012)”, in “URL http://www.
pascal-network. org/challenges/VOC/voc2011/workshop/index. html”, (2011).
115
Felzenszwalb, P. F., R. B. Girshick, D. McAllester and D. Ramanan, “Object de-
tection with discriminatively trained part-based models”, IEEE transactions on
pattern analysis and machine intelligence 32, 9, 1627–1645 (2009).
Gao, M., J. Pu, X. Yang, M. Horowitz and C. Kozyrakis, “Tetris: Scalable and effi-
cient neural network acceleration with 3d memory”, in “ACM SIGARCH Computer
Architecture News”, vol. 45, pp. 751–764 (ACM, 2017).
Gokmen, T., M. Onen and W. Haensch, “Training deep convolutional neural networks
with resistive cross-point devices”, Frontiers in neuroscience 11, 538 (2017).
Gokmen, T. and Y. Vlasov, “Acceleration of deep neural network training with resis-
tive cross-point devices: Design considerations”, Frontiers in neuroscience 10, 333
(2016).
Guan, T., X. Zeng and M. Seok, “Extending memory capacity of neural associative
memory based on recursive synaptic bit reuse”, in “Design, Automation & Test in
Europe Conference & Exhibition (DATE), 2017”, pp. 1603–1606 (IEEE, 2017a).
Guan, T., X. Zeng and M. Seok, “Recursive binary neural network learning model
with 2.28 b/weight storage requirement”, arXiv preprint arXiv:1709.05306 (2017b).
Han, S., H. Mao and W. J. Dally, “Deep compression: Compressing deep neural
networks with pruning, trained quantization and huffman coding”, arXiv preprint
arXiv:1510.00149 (2015).
He, K., X. Zhang, S. Ren and J. Sun, “Deep residual learning for image recognition”,
in “Proceedings of the IEEE conference on computer vision and pattern recogni-
tion”, pp. 770–778 (2016).
Hubara, I., M. Courbariaux, D. Soudry, R. El-Yaniv and Y. Bengio, “Binarized neural
networks”, in “Advances in neural information processing systems”, pp. 4107–4115
(2016).
Hubara, I., M. Courbariaux, D. Soudry, R. El-Yaniv and Y. Bengio, “Quantized neu-
ral networks: Training neural networks with low precision weights and activations”,
The Journal of Machine Learning Research 18, 1, 6869–6898 (2017).
INRIA, “Person dataset for pedestrian detection”, in “URL
http://pascal.inrialpes.fr/data/human/”, (2005).
Jain, V. and E. Learned-Miller, “Fddb: A benchmark for face detection in uncon-
strained settings”, (2010).
Jeon, D., Q. Dong, Y. Kim, X. Wang, S. Chen, H. Yu, D. Blaauw and D. Sylvester, “A
23mw face recognition accelerator in 40nm cmos with mostly-read 5t memory”, in
“2015 Symposium on VLSI Circuits (VLSI Circuits)”, pp. C48–C49 (IEEE, 2015).
Jiang, Z., S. Yin, M. Seok and J.-s. Seo, “Xnor-sram: In-memory computing sram
macro for binary/ternary deep neural networks”, in “2018 IEEE Symposium on
VLSI Technology”, pp. 173–174 (IEEE, 2018).
116
Jouppi, N. P., C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates,
S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter performance analysis of a
tensor processing unit”, in “2017 ACM/IEEE 44th Annual International Sympo-
sium on Computer Architecture (ISCA)”, pp. 1–12 (IEEE, 2017).
Khwa, W.-S., J.-J. Chen, J.-F. Li, X. Si, E.-Y. Yang, X. Sun, R. Liu, P.-Y. Chen,
Q. Li, S. Yu et al., “A 65nm 4kb algorithm-dependent computing-in-memory sram
unit-macro with 2.3 ns and 55.8 tops/w fully parallel product-sum operation for
binary dnn edge processors”, in “2018 IEEE International Solid-State Circuits
Conference-(ISSCC)”, pp. 496–498 (IEEE, 2018).
Kim, M., A. Mohanty, D. Kadetotad, L. Wei, X. He, Y. Cao and J.-S. Seo, “A real-
time 17-scale object detection accelerator with adaptive 2000-stage classification in
65 nm cmos”, IEEE Transactions on Circuits and Systems I: Regular Papers 66,
10, 3843–3853 (2019).
Koestinger, M., P. Wohlhart, P. M. Roth and H. Bischof, “Annotated facial landmarks
in the wild: A large-scale, real-world database for facial landmark localization”, in
“2011 IEEE international conference on computer vision workshops (ICCV work-
shops)”, pp. 2144–2151 (IEEE, 2011).
Kung, H., “Algorithms for vlsi processor arrays”, Introduction to VLSI systems pp.
271–292 (1980).
Lee, J., J. Lee, D. Han, J. Lee, G. Park and H.-J. Yoo, “7.7 lnpu: A 25.3 tflops/w
sparse deep-neural-network learning processor with fine-grained mixed precision of
fp8-fp16”, in “2019 IEEE International Solid-State Circuits Conference-(ISSCC)”,
pp. 142–144 (IEEE, 2019).
Lee, K. J., K. Bong, C. Kim, J. Jang, H. Kim, J. Lee, K.-R. Lee, G. Kim and H.-J.
Yoo, “14.2 a 502gops and 0.984 mw dual-mode adas soc with rnn-fis engine for
intention prediction in automotive black-box system”, in “2016 IEEE International
Solid-State Circuits Conference (ISSCC)”, pp. 256–257 (IEEE, 2016).
Li, H., Z. Lin, X. Shen, J. Brandt and G. Hua, “A convolutional neural network
cascade for face detection”, in “Proceedings of the IEEE conference on computer
vision and pattern recognition”, pp. 5325–5334 (2015).
Liu, B., H. Li, Y. Chen, X. Li, Q. Wu and T. Huang, “Vortex: variation-aware training
for memristor x-bar”, in “Proceedings of the 52nd Annual Design Automation
Conference”, p. 15 (ACM, 2015).
Mathias, M., R. Benenson, M. Pedersoli and L. Van Gool, “Face detection without
bells and whistles”, in “European conference on computer vision”, pp. 720–735
(Springer, 2014).
Merolla, P. A., J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan,
B. L. Jackson, N. Imam, C. Guo, Y. Nakamura et al., “A million spiking-neuron
integrated circuit with a scalable communication network and interface”, Science
345, 6197, 668–673 (2014).
117
Moons, B., R. Uytterhoeven, W. Dehaene and M. Verhelst, “14.5 envision: A 0.26-
to-10tops/w subword-parallel dynamic-voltage-accuracy-frequency-scalable convo-
lutional neural network processor in 28nm fdsoi”, in “2017 IEEE International
Solid-State Circuits Conference (ISSCC)”, pp. 246–247 (IEEE, 2017).
Parveen, F., Z. He, S. Angizi and D. Fan, “Hielm: Highly flexible in-memory com-
puting using stt mram”, in “2018 23rd Asia and South Pacific Design Automation
Conference (ASP-DAC)”, pp. 361–366 (IEEE, 2018).
Ranjan, R., V. M. Patel and R. Chellappa, “Hyperface: A deep multi-task learning
framework for face detection, landmark localization, pose estimation, and gender
recognition”, IEEE Transactions on Pattern Analysis and Machine Intelligence 41,
1, 121–135 (2017).
Rastegari, M., V. Ordonez, J. Redmon and A. Farhadi, “Xnor-net: Imagenet classi-
fication using binary convolutional neural networks”, in “European Conference on
Computer Vision”, pp. 525–542 (Springer, 2016).
Redmon, J. and A. Farhadi, “Yolo9000: better, faster, stronger”, in “Proceedings of
the IEEE conference on computer vision and pattern recognition”, pp. 7263–7271
(2017).
Sancho, J. C. and D. J. Kerbyson, “Analysis of double buffering on two different
multicore architectures: Quad-core opteron and the cell-be”, in “2008 IEEE In-
ternational Symposium on Parallel and Distributed Processing”, pp. 1–12 (IEEE,
2008).
Shafiee, A., A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu,
R. S. Williams and V. Srikumar, “Isaac: A convolutional neural network accel-
erator with in-situ analog arithmetic in crossbars”, ACM SIGARCH Computer
Architecture News 44, 3, 14–26 (2016).
Shin, D., J. Lee, J. Lee and H.-J. Yoo, “14.2 dnpu: An 8.1 tops/w reconfigurable
cnn-rnn processor for general-purpose deep neural networks”, in “2017 IEEE Inter-
national Solid-State Circuits Conference (ISSCC)”, pp. 240–241 (IEEE, 2017).
Simonyan, K. and A. Zisserman, “Very deep convolutional networks for large-scale
image recognition”, arXiv preprint arXiv:1409.1556 (2014).
Song, L., X. Qian, H. Li and Y. Chen, “Pipelayer: A pipelined reram-based accelerator
for deep learning”, in “2017 IEEE International Symposium on High Performance
Computer Architecture (HPCA)”, pp. 541–552 (IEEE, 2017).
Srivastava, P., M. Kang, S. K. Gonugondla, S. Lim, J. Choi, V. Adve, N. S. Kim
and N. Shanbhag, “Promise: An end-to-end design of a programmable mixed-
signal accelerator for machine-learning algorithms”, in “Proceedings of the 45th
Annual International Symposium on Computer Architecture”, pp. 43–56 (IEEE
Press, 2018).
118
Suleiman, A., Y.-H. Chen, J. Emer and V. Sze, “Towards closing the energy gap
between hog and cnn features for embedded vision”, in “2017 IEEE International
Symposium on Circuits and Systems (ISCAS)”, pp. 1–4 (IEEE, 2017a).
Suleiman, A. and V. Sze, “An energy-efficient hardware implementation of hog-based
object detection at 1080hd 60 fps with multi-scale support”, Journal of Signal
Processing Systems 84, 3, 325–337 (2016).
Suleiman, A., Z. Zhang and V. Sze, “A 58.6 mw real-time programmable object
detector with multi-scale multi-object support using deformable parts model on
1920× 1080 video at 30fps”, in “2016 IEEE Symposium on VLSI Circuits (VLSI-
Circuits)”, pp. 1–2 (IEEE, 2016).
Suleiman, A., Z. Zhang and V. Sze, “A 58.6 mw 30 frames/s real-time pro-
grammable multiobject detection accelerator with deformable parts models on
full hd 1920×1080 videos”, IEEE Journal of Solid-State Circuits 52, 3, 844–855
(2017b).
Sun, X., S. Yin, X. Peng, R. Liu, J.-s. Seo and S. Yu, “Xnor-rram: A scalable and
parallel resistive synaptic architecture for binary neural networks”, in “2018 Design,
Automation & Test in Europe Conference & Exhibition (DATE)”, pp. 1423–1428
(IEEE, 2018).
Takagi, K., K. Tanaka, S. Izumi, H. Kawaguchi and M. Yoshimoto, “A real-time
scalable object detection system using low-power hog accelerator vlsi”, Journal of
Signal Processing Systems 76, 3, 261–274 (2014).
Timofte, R., K. Zimmermann and L. Van Gool, “Multi-view traffic sign detection,
recognition, and 3d localisation”, Machine vision and applications 25, 3, 633–647
(2014).
Valavi, H., P. J. Ramadge, E. Nestler and N. Verma, “A mixed-signal binarized
convolutional-neural-network accelerator integrating dense weight storage and mul-
tiplication for reduced data movement”, in “2018 IEEE Symposium on VLSI Cir-
cuits”, pp. 141–142 (IEEE, 2018).
Viola, P., M. Jones et al., “Rapid object detection using a boosted cascade of simple
features”, CVPR (1) 1, 511-518, 3 (2001).
Von Neumann, J., “The principles of large-scale computing machines”, Annals of the
History of Computing 3, 3, 263–273 (1981).
Yang, S., P. Luo, C. C. Loy and X. Tang, “Faceness-net: Face detection through
deep facial part responses”, IEEE transactions on pattern analysis and machine
intelligence 40, 8, 1845–1859 (2017).
Yin, S., P. Ouyang, S. Tang, F. Tu, X. Li, S. Zheng, T. Lu, J. Gu, L. Liu and S. Wei,
“A high energy efficient reconfigurable hybrid neural network processor for deep
learning applications”, IEEE Journal of Solid-State Circuits 53, 4, 968–982 (2017).
119
Yu, S., Neuro-inspired computing using resistive synaptic devices (Springer, 2017).
Yu, S., “Neuro-inspired computing with emerging nonvolatile memorys”, Proceedings
of the IEEE 106, 2, 260–285 (2018).
Zhang, C., P. Li, G. Sun, Y. Guan, B. Xiao and J. Cong, “Optimizing fpga-based ac-
celerator design for deep convolutional neural networks”, in “Proceedings of the
2015 ACM/SIGDA International Symposium on Field-Programmable Gate Ar-
rays”, pp. 161–170 (ACM, 2015).
Zhang, J., Z. Wang and N. Verma, “A machine-learning classifier implemented in
a standard 6t sram array”, in “2016 IEEE Symposium on VLSI Circuits (VLSI-
Circuits)”, pp. 1–2 (IEEE, 2016).
Zhou, S., Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou, “Dorefa-net: Training
low bitwidth convolutional neural networks with low bitwidth gradients”, arXiv
preprint arXiv:1606.06160 (2016).
Zhu, X. and D. Ramanan, “Face detection, pose estimation, and landmark local-
ization in the wild”, in “2012 IEEE conference on computer vision and pattern
recognition”, pp. 2879–2886 (IEEE, 2012).
120
