FPGA-based high-performance neural network acceleration by Geng, Tong
Boston University
OpenBU http://open.bu.edu













B.E., Zhejiang University, 2012
M.S., Eindhoven University of Technology, 2015
Submitted in partial fulfillment of the





All rights reserved except for chapter 3, which is
©2020 by IEEE and ACM, chapter 4, which is
©2019 by IEEE, chapter 5, which is ©2019 by
ACM and ©2021 by IEEE, chapter 6, which is
©2020 by IEEE, chapter 7, which is ©2018, 2020
by IEEE, and chapter 8, which is ©2020 by ACM
Approved by
First Reader
Martin C. Herbordt, Ph.D.
Professor of Electrical and Computer Engineering
Second Reader
Roscoe C. Giles, Ph.D.
Professor of Electrical and Computer Engineering
Third Reader
Tali Moreshet, Ph.D.




HPC Team C Computer Scientist
Pacific Northwest National Laboratory
Fifth Reader
Caiwen Ding, Ph.D.
Assistant Professor of Computer Science and Engineering
University of Connecticut
Do not go gentle into that good night.
Rage, rage against the dying of the light. Dylan Thomas
iv
Acknowledgments
The development of this dissertation has a 6-year history, and I am greatly indebted to many
professors, colleagues, researchers, and institutions. Particularly, I am profoundly grateful
to my advisor, Prof. Martin. C. Herbordt. He offered penetrating insights and limitless help
at every turn of my research, gave me precious freedom to conduct research in the field I
am interested in, and encouraged me to challenge myself to apply a higher standard to my
academic research. Different from many other supervisors, Prof. Martin. C. Herbordt is not
only the advisor in my academic path but also the mentor in my personal life. His wisdom,
wit, and sense of humor have made many long afternoon meetings enjoyable adventures
and made my graduate career much less insufferable. It is my honor to have him as my
Ph.D. advisor in my life.
I also want to sincerely thank my previous Ph.D. advisor, Prof. Henk Corporaal at
Eindhoven University of Technology. He led me into the world of research and made me
confident to become a good researcher though I knew nothing about computer architecture
when I first met him. I brought me to my first conference at Greek and calmed me down
before I gave my first talk in front of professionals by telling me the story of his first talk.
Without his recommendation, I would not get the chance to transfer to Boston University
and join the research group of Prof. Martin. C Herbordt.
I am grateful to my mentor at Pacific Northwest National Laboratory, Dr. Ang Li.
Without his substantial support and advice, I would not have my dissertation. He introduced
me to the community and let me know what the community cares about.
I would like to thank all the other members on my dissertation defense committee.
Prof. Roscoe Giles helped me summarize my thesis statement. Prof. Tali Moreshet helped
me address many technical issues in my prospectus and dissertation. Prof. Caiwen Ding
provided me many insightful suggestions in the optimization of neural network models.
I would also like to thank all my collaborators and colleagues including Dr. Tianqi
v
Wang, Chunshu Wu, Dr. Runbin Shi, Dr. Chen Yang, Pouya Haghi, Rushi Patel, Dr.
Qingqing Xiong, Dr. Ahmed Sanaulah, Dr. Yifan He, Dr. Xin Chen, Dr. Luc Waeijen,
Dr. Maurice Peemen, Dr. Erkan Diken, Prof. Lech Jozwiak, Prof. Yanzhi Wang, Prof.
Shuaiwen Leon Song, Dr. Shuai Che, and Dr. Steve Reinhardt. Without your kind help and
hard work, I would not have my dissertation.
I am also grateful to my current CAAD lab mates, Rushi Patel, Robert Munafo, Chun-
shu Wu, Pouya Haghi, Sahan Bandara, Anqi Guo, Anthony Ducimo, and Pierre-Francois
Wolfe. They gave me plenty of research advice and brought lots of fun to my Ph.D. life.
Last but not least, I would like to sincerely thank my family for their unconditional and
limitless support during my Ph.D. journey. My dear wife Sarah Yuan He is always powerful
support of me and ready to lend me her strong shoulder to lay on when I feel weak. I got my
first child, Charles Geng, born on June 13rd 2020. Sarah was even helping me revise my
rebuttal of MICRO on the night of June 12nd. Without her support and help, I would not
get the MICRO paper accepted and would not have my dissertation. I also want to thank
my cute daughter, Charlie, who brought me plenty of luck to have 3 papers accepted by 3
top-tier conferences and journals in 2020. Although the birth of Charlie ruined my sound
sleep, she makes my life more promising and brighter. My dear grandmother passed away
in 2020. I did not have the chance to get back to see her. I hope my Ph.D. degree could
make her smile in paradise.
The research that forms the basis of this dissertation has been partially funded by the
NSF through Awards CNS-1405695, CCF-1618303/7960, CCF-1618303, CCF-1919130,
and CNS-1925504; by the NIH through Award R44GM128533; by grants from Microsoft
and Red Hat; by Xilinx and by Intel through donated FPGAs, tools, and IP; by Pa-
cific Northwest National Laboratory’s DMC-CFA project and DeepScience-HPC LDRD
project; and by the U.S. DOE Office of Science, Office of Advanced Scientific Computing
Research, under award 66150: “CENATE - Center for Advanced Architecture Evaluation.”
vi
FPGA-BASED HIGH-PERFORMANCE NEURAL NETWORK
ACCELERATION
TONG GENG
Boston University, College of Engineering, 2021
Major Professor: Martin C. Herbordt, PhD
Professor of Electrical and Computer Engineering
ABSTRACT
In the last ten years Artificial Intelligence through Deep Neural Networks (DNNs)
has penetrated virtually every aspect of science, technology, and business. Advances are
rapid with thousands of papers being published annually. Many types of DNNs have been
and continue to be developed – in this thesis we address Convolutional Neural Networks
(CNNs), Recurrent Neural Networks (RNNs), and Graph Neural Networks (GNNs) – each
with a different set of target applications and implementation challenges. The overall prob-
lem for all of these Neural Networks (NNs) is that their target applications generally pose
stringent constraints on latency and throughput, but also have strict accuracy requirements.
Much research has therefore gone into all aspects of improving NN quality and perfor-
mance: algorithms, code optimization, acceleration with GPUs, and acceleration with hard-
ware, both dedicated ASICs and off-the-shelf FPGAs. In this thesis we concentrate on the
last of these approaches.
There have been many previous efforts in creating hardware to accelerate NNs. The
problem designers face is that optimal NN models typically have significant irregularities,
making them hardware unfriendly. One commonly used approach is to train NN models to
follow regular computation and data patterns. This approach, however, can hurt the mod-
vii
els’ accuracy or lead to models with non-negligible redundancies. This dissertation takes
a different approach. Instead of regularizing the model, we create architectures friendly to
irregular models. Our thesis is that high-accuracy and high-performance NN inference
and training can be achieved by creating a series of novel irregularity-aware archi-
tectures for Field-Programmable Gate Arrays (FPGAs). In four different studies on
four different NN types we find that this approach results in speedups of 2.1× to 3255×
compared with carefully selected prior art; for inference, there is no change in accuracy.
The bulk of this dissertation revolves around these studies, the various workload balanc-
ing techniques, and the resulting NN acceleration architectures. In particular, we propose
four different architectures to handle, respectively, data structure level, operation level, bit
level, and model level irregularities.
At the data structure level, we propose AWB-GCN, which uses runtime workload re-
balancing to handle Sparse Matrices Multiplications (SpMM) on extremely sparse and
unbalanced input. With GNN inference as a case study, AWB-GCN achieves over 90%
system efficiency, guarantees efficient off-chip memory access, and provides considerable
speedups over CPUs (3255×), GPUs (80×), and a prior ASIC accelerator (5.1×).
At the operation level, we propose O3BNN-R, which can detect redundant operations
and prune them at run time. This works even for those that are highly data-dependent and
unpredictable. With Binarized NNs (BNNs) as a case study, O3BNN-R can prune over
30% of the operations, without any accuracy loss, yielding speedups over state-of-the-art
implementations on CPUs (1122×), GPUs (2.3×), and FPGAs (2.1×).
At the bit level, we propose CQNN. CQNN embeds a Coarse-Grained Reconfigurable
Architecture (CGRA) which can be programmed at runtime to support NN functions with
various data-width requirements. Results show that CQNN can deliver µs-level Quantized
NN (QNN) inference.
At the model level, we propose FPDeep especially for training. In order to address
viii
model-level irregularity, FPDeep uses a novel model partitioning schemes to balance work-
load and storage among nodes. By using a hybrid of model and layer parallelism to train
DNNs, FPDeep avoids the large gap that commonly occurs between training and testing
accuracy due to the improper convergence to sharp minimizers (caused by large training
batches). Results show that FPDeep provides scalable, fast, and accurate training and leads





2.1 GCN Structure and Characteristics . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Graph Convolutional Network Structure . . . . . . . . . . . . . . . 12
2.1.2 Characteristics of Power-Law Graphs . . . . . . . . . . . . . . . . 14
2.2 Binarized Neural Network and Quantized Neural Network Structure . . . . 15
2.3 Distributive CNN Training . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 FPGAs In Neural Network Acceleration . . . . . . . . . . . . . . . . . . . 20
2.4.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 FPGA in Neural Network Acceleration . . . . . . . . . . . . . . . 22
3 AWB-GCN: A Graph Convolutional Network Accelerator with Runtime
Workload Autotuning 24
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 GCN Baseline Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Matrix Computation Order . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 SpMM Execution Order and Mapping . . . . . . . . . . . . . . . . 30
3.2.3 Design of Baseline Architecture . . . . . . . . . . . . . . . . . . . 31
3.2.4 Pipelining SpMM Chains . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.5 The Workload Balance Problem . . . . . . . . . . . . . . . . . . . 36
3.3 AWB-GCN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 Distribution Smoothing . . . . . . . . . . . . . . . . . . . . . . . . 38
x
3.3.2 Remote Switching . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.3 Evil Row Remapping . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.1 Evaluation Configuration . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.2 AWB-GCN Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.3 Scalability of AWB-GCN . . . . . . . . . . . . . . . . . . . . . . 53
3.4.4 Cross-platform Comparison . . . . . . . . . . . . . . . . . . . . . 53
3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4 LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism 61
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Network Structure Optimization . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.1 Layer Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.2 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5 O3BNN-R: An Out-of-Order Architecture for High-Performance BNN Infer-
ence with redundant operation pruning 82
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Motivation for Redundant Edge Pruning . . . . . . . . . . . . . . . . . . . 86
5.3 Pruning Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.1 Basic BCONV design . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.2 Threshold-based Edge pruning . . . . . . . . . . . . . . . . . . . . 89
5.3.3 Pooling-based edge pruning . . . . . . . . . . . . . . . . . . . . . 90
xi
5.4 Out-of-Order BNN Pruning Design . . . . . . . . . . . . . . . . . . . . . . 91
5.4.1 Parallelization Strategy . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.2 Rotative Workload Scheduling . . . . . . . . . . . . . . . . . . . . 93
5.4.3 O3BNN-R Architecture . . . . . . . . . . . . . . . . . . . . . . . 94
5.4.4 Design Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5 Regularized Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.5.1 Regularization for Threshold-based Pruning . . . . . . . . . . . . . 101
5.5.2 Regularization for Pooling-based Pruning . . . . . . . . . . . . . . 102
5.6 Evaluation and Experimental Results . . . . . . . . . . . . . . . . . . . . . 104
5.6.1 Ideal Pruning Rate vs Network Accuracy . . . . . . . . . . . . . . 105
5.6.2 Hardware Demand versus Performance . . . . . . . . . . . . . . . 109
5.6.3 Cross-platform Evaluation . . . . . . . . . . . . . . . . . . . . . . 113
5.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6 CQNN: a CGRA-based architecture for Mixed-precision DNNs 119
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3 BNN Module Integration for QNN . . . . . . . . . . . . . . . . . . . . . . 122
6.3.1 Q-CONV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.3.2 QT-BN Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3.3 Q-POOL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4 Design of CQNN Framework . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4.1 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.4.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.4.3 Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
xii
6.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.5.1 Performance with different data-widths . . . . . . . . . . . . . . . 132
6.5.2 Cross-platform Comparison . . . . . . . . . . . . . . . . . . . . . 133
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7 FPDeep: Scalable Acceleration of CNN Training on Deeply-Pipelined FPGA
Clusters 135
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.3 FPDeep Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.3.2 Operator Graph Partitioning Methodology . . . . . . . . . . . . . . 143
7.3.3 Design Choices in Operator Graph Partitioning . . . . . . . . . . . 145
7.3.4 Mathematical Model of FPDeep . . . . . . . . . . . . . . . . . . . 148
7.4 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.4.1 Input/output channel partition implementation . . . . . . . . . . . . 153
7.4.2 Dataflow Analysis and Interconnection Topology . . . . . . . . . . 154
7.4.3 Deep Fine-Grained Pipeline . . . . . . . . . . . . . . . . . . . . . 157
7.4.4 Parameter Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.5 Hardware Accelerator Architecture . . . . . . . . . . . . . . . . . . . . . . 160
7.5.1 Overall Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.5.2 Single-FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . 163
7.5.3 Forward Propagation (FP) . . . . . . . . . . . . . . . . . . . . . . 165
7.5.4 Error Back-Propagation (EB) . . . . . . . . . . . . . . . . . . . . 166
7.5.5 Parameter Gradient Calculation (PG) . . . . . . . . . . . . . . . . 166
7.6 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.6.1 Small Scale Cluster Experiments . . . . . . . . . . . . . . . . . . . 167
xiii
7.6.2 Large Scale Cluster Experiments . . . . . . . . . . . . . . . . . . . 167
7.6.3 Utilization and Performance . . . . . . . . . . . . . . . . . . . . . 169
7.6.4 DNN Model Convergence . . . . . . . . . . . . . . . . . . . . . . 173
7.7 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 173
8 CSB-RNN: A Faster-than-Realtime RNN Acceleration Framework 176
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
8.2.1 Temporal Sequence Processing with RNN . . . . . . . . . . . . . . 178
8.2.2 RNN Weight Pruning Techniques . . . . . . . . . . . . . . . . . . 180
8.3 CSB Pruning Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
8.3.1 A Novel Structured Sparse Weight Format . . . . . . . . . . . . . . 183
8.3.2 CSB Pruning Flow with ADMM . . . . . . . . . . . . . . . . . . . 185
8.4 Unified Architecture for CSB-RNN . . . . . . . . . . . . . . . . . . . . . 187
8.4.1 Overview of Acceleration Framework . . . . . . . . . . . . . . . . 187
8.4.2 Programmable RNN Dataflow Architecture . . . . . . . . . . . . . 188
8.4.3 CSB-Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.5 Compilation for CSB Pruned Model . . . . . . . . . . . . . . . . . . . . . 193
8.5.1 RNN Cell to Dataflow Architecture . . . . . . . . . . . . . . . . . 194
8.5.2 Workload Scheduling on CSB-Engine . . . . . . . . . . . . . . . . 196
8.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.6.1 Implementation and Experiments Setup . . . . . . . . . . . . . . . 199
8.6.2 Evaluation of CSB pruning Rate . . . . . . . . . . . . . . . . . . . 200
8.6.3 Evaluation of RNN dataflow Architecture with CSB Pruned Model 204
8.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9 Conclusions and Future Work 210
9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
xiv





2.1 Matrix density and dimensions of 5 widely-used GCN datasets. . . . . . . . 15
2.2 Definitions of symbols used in the equations . . . . . . . . . . . . . . . . . 20
3.1 Operations required under different exec orders. . . . . . . . . . . . . . . . 29
3.2 Comparison with CPU, GPU and Baseline processing Standard_networks.
OoM: Out of Memory. Units of Latency and Energy efficiency are ms and
graph/kJ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Comparison with the prior art, HyGCN, processing HyGCN_networks cus-
tomized in HyGCN paper [Yan et al., 2020]. Units of Latency and Energy
efficiency are ms and graph/kJ. . . . . . . . . . . . . . . . . . . . . . . . 56
4.1 Resource usage of LUTs, FFs, BRAMs, and DSPs for VGG inference with-
/without workload balancing (same latency) . . . . . . . . . . . . . . . . . 76
4.2 Latency, Performance, Energy Efficiency comparison using different tem-
plates, GPUs (Tesla K40 [Liang et al., 2018], V100 [Li et al., 2019a],
and GTX 1080 [Hu et al., 2018]), FPGAs (Stratix V [Liang et al., 2018],
VCU108 [Ghasemzadeh et al., 2018]), CPUs (Xeon E5-2640 [Liang et al.,
2018], Phi 7210 [Hu et al., 2018], and i7-7700 [Hu et al., 2018]) to execute
inference of 4 Networks: Cifar-10 VGG-like [Courbariaux et al., 2015],
ImageNet AlexNet [Krizhevsky et al., 2012], VGGNet-16 [Simonyan and
Zisserman, 2015] and ResNet-18 [Andri et al., 2018]. . . . . . . . . . . . . 79
xvi
5.1 Structures of the Networks used to evaluate O3BNN-R. 512FC refers to a
fully-connected layer with 512 neurons. 2x128C3 refers to 2 convolution
layer with 128 output channels and 3x3 filter. MP2 refers to a 2x2 max-
pooling layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Latency, hardware demand, and accuracy of baseline and O3BNN-Rs with
different OoO capabilities (1, 2, 3) and with lossless or lossy pruning for
non-regularized models. Both the baseline design and O3BNN-R imple-
mentations are equipped with 512 PEs. For lossy pruning the relaxing
factors used in VGG-Like, AlexNet, and VGG-16 are 0.7, 0.85, and 0.9,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.3 Cross-platform evaluation of Latency, Energy Efficiency, and Accuracy:
VGG-like-FINN [Blott et al., 2018] for Cifar-10. O3BNN refers to the
design without regularization techniques; O3BNN-R refers to the design
with both regularization techniques. . . . . . . . . . . . . . . . . . . . . . 110
5.4 Cross-platform evaluation of Latency, Energy Efficiency, and Accuracy:
VGG-like [Courbariaux et al., 2015] for Cifar-10. O3BNN refers to the
design without regularization techniques; O3BNN-R refers to the design
with both regularization techniques. . . . . . . . . . . . . . . . . . . . . . 111
5.5 Cross-platform evaluation of Latency, Energy Efficiency, and Accuracy:
AlexNet [Krizhevsky et al., 2012] for ImageNet. O3BNN refers to the de-
sign without regularization techniques; O3BNN-R refers to the design with
both regularization techniques. Our results are compared with 3 existing
works with GPU V100 [Li et al., 2019a], FPGA VCU108 [Ghasemzadeh
et al., 2018], and FPGA Stratix-V [Liang et al., 2018]. . . . . . . . . . . . . 111
xvii
5.6 Cross-platform evaluation of Latency, Energy Efficiency, and Accuracy:
VGGNet-16 [Simonyan and Zisserman, 2015] for ImageNet. O3BNN
refers to the design without regularization techniques; O3BNN-R refers
to the design with both regularization techniques. . . . . . . . . . . . . . . 112
6.1 Execution latency (µs) and BCONV utilization of QNN layers with differ-
ent numbers of input channels (NIC) and data-widths (DW) of features and
parameters. All layers have 2×2 MAX-Pooling and 128 output channels.
Image size is 128 × 128. . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.2 Cross-platform comparison and evaluation: inference latency in ms, energy
efficiency in img/KJ. CQNNs are compared to existing FPGA QNN works
(e.g. Stratix-V [Liang et al., 2018], VCU108 [Ghasemzadeh et al., 2018],
ZC706 [Geng et al., 2021], and KCU1500 [Geng et al., 2019a]) and a GPU
TensorFlow-based implementation [Li et al., 2019a]. . . . . . . . . . . . . 133
6.3 Structures of the networks used to evaluate CQNN. . . . . . . . . . . . . . 134
7.1 Performance of small-batch (SB) and large-batch (LB). Note that LB does
not decrease training accuracy, but reduces the test accuracy [Keskar et al.,
2016] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.2 Hardware Constraint Parameters . . . . . . . . . . . . . . . . . . . . . . . 143
7.3 Operator Partition Choice. Different operator graph partition design
choices make possible different parallelizability methods. . . . . . . . . . . 147
7.4 Qualitative comparison, from 1 (worst) to 4 (best), of different partition
methods with respect to various parameters . . . . . . . . . . . . . . . . . 150
7.5 Cluster-level experimental results. All CPU, GPU, and FPGA implemen-
tations use single precision floating point. [Zhang et al., 2016a], [Lead-
erGPU, 2018a], and [LeaderGPU, 2018b] do not give experiment results of
training time per epoch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
xviii
8.1 Benchmark Models in CSB-RNN Evaluation. MT: Machine Translation;
SR: Speech Recognition; SPP: Stock Price Prediction; SC: Sentiment Clas-
sification; QA: Question Answering; App: Application; #L: Layer; LI:
Layer Index; IN: Input Neuron; HN: Hidden Neuron; EM: Evaluation Met-
ric; PPL: Perplexity; PER: Phoneme Error Rate; Acc: Accuracy; NPD:
Normalized Price Dist. Datasets used are PTB [Marcus et al., 1993],
TIMIT [Garofolo et al., 1993], TDIGIT [Leonard et al., 1993], S&P500,
IMDB [Maas et al., 2011], MR [Pang and Lee, 2005], and BABI [Weston
et al., 2015]. RNN cell types used are LSTM [Hochreiter and Schmidhu-
ber, 1997], LSTMP [Sak et al., 2014], GRU [Cho et al., 2014], and Li-
GRU [Ravanelli et al., 2018] . . . . . . . . . . . . . . . . . . . . . . . . . 201
8.2 Pruning Rate Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
8.3 Latency and Power Efficiency Comparison . . . . . . . . . . . . . . . . . . 208
xix
List of Figures
2·1 Illustration of a GCONV layer in GCNs. . . . . . . . . . . . . . . . . . . . 14
2·2 Non-zero distribution imbalance of Adjacency matrices in Cora, Citeseer,
Pubmed, Nell and Reddit datasets. . . . . . . . . . . . . . . . . . . . . . . 15
2·3 The structures of BNNs with original BN in [Rastegari et al., 2016] . . . . . 16
2·4 A simple 3-CONV-1-FC BNN Network structure. It is similar to DNN,
except that Activation acts as QUANT. QNN model structure can be further
optimized. Floating-point BN and QUANT functions can be merged into
QT-BN with multiple thresholds. . . . . . . . . . . . . . . . . . . . . . . . 18
2·5 Illustration of computations involved in CNN training including datapaths
for forward and backward propagation and parameter gradient update. . . . 19
2·6 Execution details of forward and backward propagation with zoom-in on
adjacent CONV layers, operations, and data dependencies. . . . . . . . . . 19
3·1 Histograms show ordered non-zero per-row density. Left: Adjacency ma-
trix of the NELL graph (avg. density: 0.0073%) has most of non-zeros
clustered in 70/66k rows. Right: Unstructured compressed AlexNet weight
matrix (avg. density: 27%) has workload roughly balanced across 384 rows. 26
3·2 Adjacency matrix of NELL following power-law distribution: elements are
clustered regionally and in a few rows/cols. The matrix density is 0.0073%.
For better visualization, non-zero dots are enlarged. . . . . . . . . . . . . . 27
3·3 AWB-GCN utilization improvement per round. . . . . . . . . . . . . . . . 28
xx
3·4 (A) SpMM computation order: Column-wise-product; (B) Matrix parti-
tioning & mapping among PEs. . . . . . . . . . . . . . . . . . . . . . . . . 31
3·5 Architecture of the proposed baseline SpMM engine. . . . . . . . . . . . . 32
3·6 Pipelined SpMMs: data production and consumption rates match across
consecutive SpMMs by allocating PEs in proportion to workload sizes. . . . 35
3·7 Matrix Blocking Optimization to reduce the off-chip bandwidth require-
ment. The sub-SpMM of each pair of blocks is performed in column-wise-
product order. The numbers represent execution orders. . . . . . . . . . . . 36
3·8 PE utilization waves of 256-PE Baseline SpMM engine processing A×
(XW ) of Nell and Citeseer. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3·9 Rebalancing process: distribution smoothing, remote switching and row
remapping per round. (a)&(b): 1st round; (c)&(d): 2nd round. . . . . . . . 39
3·10 Simplified architecture of distribution smoothing. . . . . . . . . . . . . . . 41
3·11 Overall architecture of SpMM engine in AWB-GCN with three rebalancing
techniques: distribution smoothing, remote switching (red bordered) and
evil row remapping (purple bordered). Here every 128 PEs has one Super-
PEs and four Labor-PEs. These numbers can be customized. . . . . . . . . 42
3·12 Overall performance and PE utilization of 1K-PE AWB-GCN with five
design choices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3·13 Per-SpMM performance and PE utilization of 1K-PE AWB-GCN with five
design choices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3·14 (A) Hardware resource consumption normalized to the number of ALMs
and (B) On-chip storage demand of 1K-PE AWB-GCN. . . . . . . . . . . . 50
3·15 AWB-GCN PE (1K) average utilization per round of workload autotuning. . 51
3·16 Scalability evaluation: PE utilization and overall performance of Baseline,
Design(B) and Design(D) of AWB-GCNs with 512, 1K, 2K and 4K PEs. . . 52
xxi
4·1 The original and simplified structures of Binary-ResNet. . . . . . . . . . . 67
4·2 Portion of execution time for different functions. 4 pairs of adjacent CONV
layers in ImageNet ResNet-18, 6 CONV layers and 2 FC layers in Cirfar
VGG, 13 CONV layers and 3 FC layers in ImageNet VGG are measured.
For each layer, a pair of bars is given in this chart. The left and right ones
are for original and optimized networks. The execution time of each func-
tion is divided by the execution time of the whole layer with the original
network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4·3 Data dependency of BNN and the proposed fine-grained inter-layer pipelin-
ing for layer fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4·4 Latencies of executing each layer of VGGNet inference and the overall
latency of the whole VGGNet when using inter-layer fusion . . . . . . . . . 72
4·5 Parameterized architecture of single CONV layer: SOC, SIC, POC, PIC
can be tuned for workload balancing and to degrade the parallelism . . . . . 73
4·6 DSE flow and FF utilizations with different combinations of PIC and POC
when fixed parallelisms are used at the 7th CONV layer of VGGNet . . . . 78
5·1 Three types of pruning: (A) & (B) Threshold-based edge pruning; by accu-
mulating the inputs (ix) and comparing the accumulation result to a thresh-
old, the value of a neuron (Out) is calculated (binary 1/0). (C) Pooling-
based edge pruning. Out from pooling is binary (1/0). . . . . . . . . . . . . 84
5·2 A typical 3-CONV-1-FC BNN Network structure. It is similar to DNN,
except that Activation acts as BIN, Multiplication acts as XNOR, Accumu-
lation acts as POPCOUNT. . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5·3 Pseudo code of a traditional BCONV/BFC without pruning and the sym-
bols of edge and curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
xxii
5·4 (A): illustration of the evaluation process of an output neuron using
threshold-based BN function; (B)&(C): Conditions of threshold-based
edge pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5·5 (A): BNN structure used in this work: threshold-based BN followed by
POOLING; (B): The condition of Pooling-based Edge Pruning. . . . . . . . 91
5·6 Three methods of workload scheduling as described in the text. . . . . . . . 92
5·7 Overall architecture of O3BNN-R; architectures of PE array, Scoreboard
and DFS are shown in Figure 5·9, 5·8 and 5·10. . . . . . . . . . . . . . . . 95
5·8 Architecture of PE array . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5·9 O3BNN-R Scoreboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5·10 Architecture of DFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5·11 Workload execution of O3BNN-R with different OoO capabilities. Four
PEs are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5·12 Regularization during Training . . . . . . . . . . . . . . . . . . . . . . . . 103
5·13 Pruning rate vs Accuracy trade-off with different relaxing factors. When
the relaxing factor is 1, pruning is lossless. Green lines and bars are for
original models; pink, orange and blue lines and bars are for models trained
with different combinations of regularization techniques. . . . . . . . . . . 104
5·14 Pruning rates at different layers of the non-regularized models with dif-
ferent relaxing factors. When relaxing factor is 1, threshold relaxing is
disabled and pruning is lossless. conv-l-p refers to the lth CONV layer
followed by max-pooling. f c-l refers to the lth fully connection layer. . . . 106
xxiii
5·15 Performance and hardware consumption of O3BNN-Rs with different
OoO capabilities and with lossless (without threshold relaxing) or lossy
(with threshold relaxing) pruning for the non-regularized models. 512-PE
O3BNN-Rs are compared to 512-PE baseline without pruning and ideal
design with theoretically perfect pruning. The relaxing factors for lossy
pruning used in VGG-Like, AlexNet and VGG-16 are 0.7, 0.85, and 0.9. . . 108
6·1 QCONV for a QNN layer with 2-bit features and 3-bit parameters and the
transformation from a 2-bit×3-bit multiplication into 6 bit-add operations. . 124
6·2 QT-BN architecture for a QNN layer with 3-bit features. . . . . . . . . . . 125
6·3 Q-POOL architecture for a QNN layer with 3-bit features. . . . . . . . . . 126
6·4 Framework of CQNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6·5 The overall architecture of CQNN including CP and CGRA array. Black
blocks are switches. Each Switch connected to a red block is equipped
with an accumulator and a left shift logic. The blocks covered by the gray
window can be integrated as an engine for 2-bit×5-bit QNN layers. . . . . 128
6·6 CGRA Array details and 3 types of integration for a QNN engine for a layer
with 32 input channels, 3-bit features, and 2-bit parameters. (A) Default
configuration implements a QNN engine by grouping binary components in
a 6×6 raw CGRA array. (B) & (C) Implements QNN engines by grouping
binary components in a 3×12 & 12×3 raw CGRA arrays. All components
are pipelined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6·7 Engine mapping with 3 types of configuration . . . . . . . . . . . . . . . . 130
6·8 Instruction structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7·1 An operator graph is used to represent DNN training. Nodes are operators
and edges are tensors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
xxiv
7·2 (A) Overview of the FPDeep Framework. The operator graph and hardware
constraints are input parameters. (B) FPDeep contains two phases: map-
ping and implementation. (C) The proposed DNN operation graph partition
methodology with ResNets and Inception. . . . . . . . . . . . . . . . . . . 144
7·3 Illustration of operator graph partition design choices: (A) Data paral-
lelism, (B) Layer parallelism, (C) Model parallelism, and (D) Hybrid par-
allelism (Layer + Model) . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7·4 Comparison of different operator graph partition methods accounting for
four different metrics: FLOP utilization, storage requirement, communica-
tion footprint, and average communication bandwidth . . . . . . . . . . . . 149
7·5 Parameters and activations for VGG-16 . . . . . . . . . . . . . . . . . . . 152
7·6 Partitioning image input/output channels ic/oc . . . . . . . . . . . . . . . . 154
7·7 Data flow analysis of CNN training. FPDeep pipelines the reduction oper-
ations and maps them to multiple FPGAs. . . . . . . . . . . . . . . . . . . 155
7·8 1D-2D topology design choice: while 2D seems the obvious choice, clearly
1D has better performance . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7·9 1D-2D topology performance comparison . . . . . . . . . . . . . . . . . . 158
7·10 FPDeep’s fine-grained pipeline design showing (A) data dependencies of
CNN training; (B) traditional data parallelism’s coarse-grained pipeline;
(C) FPDeep’s fine-grained pipeline. . . . . . . . . . . . . . . . . . . . . . 161
7·11 Overall architecture of FPDeep accelerator and block design of each FPGA
illustrating (A) the overall architecture of FPDeep; FPGAs can work coop-
eratively on the same layer; also, multiple layers can be mapped on the
same FPGA; (B) architecture of FPGA m, which is allocated to both layer1
and layer 2; (C) architecture of FPGA n+ 1, which is fully allocated to
layer 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
xxv
7·12 Hardware evaluation and MPI-based software simulator . . . . . . . . . . . 168
7·13 Experimental results and utilization report when mapping AlexNet and
VGG-16/19 to a cluster with 15 FPGAs . . . . . . . . . . . . . . . . . . . 169
7·14 Roofline models, percent idle stages, and epochs per hour of AlexNet,
VGGNet-16, and VGGNet-19 . . . . . . . . . . . . . . . . . . . . . . . . 171
7·15 FPDeep’s performance scalability and convergence rate. . . . . . . . . . . 172
8·1 Computation flow of RNN inference. Note that there are multiple RNN
cell types. The main workload is matrix-vector multiplication (MVM). . . . 179
8·2 CSB pruning takes advantage of both non-structured (random) pruning (a)
and coarse-grained structured (row/column) pruning (b). . . . . . . . . . . 180
8·3 A novel structured sparse matrix (CSB) with its dedicated storage format,
which benefits both the pruning flexibility and hardware parallelism. . . . . 183
8·4 Overview of CSB RNN acceleration framework, including (i) CSB pruning
algorithm, (ii) unified RNN dataflow architecture, (iii) workload compila-
tion with CSB pruned model. . . . . . . . . . . . . . . . . . . . . . . . . . 188
8·5 RNN dataflow architecture. Operation units serve the RNN arithmetic
primitives; The programmable datapaths construct the proper dataflow for
target RNN cell via instructions. . . . . . . . . . . . . . . . . . . . . . . . 189
8·6 Two-level hierarchical organization of CSB-Engine for the main workload
(CSB-MVM) computation. . . . . . . . . . . . . . . . . . . . . . . . . . . 191
8·7 Inter-block workload imbalance issue occurs when mapping the CSB
pruned matrix (a) to the vanilla (basic) CSB-Engine (b), which results in a
low hardware utilization. We propose the workload sharing technique that
significantly increases the utilization and reduces the time consumption, as
demonstrated in (c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
xxvi
8·8 Macro-instruction set (VLIW-like) for RNN dataflow architecture that con-
structs proper arithmetic dataflow for different RNN cell types. . . . . . . . 195
8·9 Micro-instruction indicates the kernel matrix workload and the scheduling
of partition for workload balancing. . . . . . . . . . . . . . . . . . . . . . 195
8·10 (a) shows the pruning rate comparison between non-structured pruning (op-
timum) and CSB pruning in different block sizes. (b) shows the normalized
index overhead (NIO). Comparing (a) and (b), we gain the insight that CSB
pruning dramatically reduces the NIO while maintaining a high pruning rate.202
8·11 Hardware resource consumption with multi CSB-Engine configs. . . . . . . 205
8·12 The efficiency (utilization) of the proposed architecture with different shar-
ing strategies. The novel workload sharing technique significantly im-
proves the average efficiency from 42% (no-sharing) to 94% (2D-sharing).
This improvement fully exploits the benefits of fine-grained CSB pruning. . 207
9·1 FPGA-based acceleration framework for irregular models and the integra-
tion of the architectures of AWB-GCN, FPDeep, O3BNN-R and CQNN. . . 213
xxvii
List of Abbreviations
ACC . . . . . . . . . . . . . Accumulation operation
ADD . . . . . . . . . . . . . Addition operation
ADMM . . . . . . . . . . . . . Alternating Direction Method of Multipliers
AGU . . . . . . . . . . . . . Address Generation Unit
ASIC . . . . . . . . . . . . . Application-Specific Integrated Circuit
AVX . . . . . . . . . . . . . Advanced Vector Extensions
BIN . . . . . . . . . . . . . Binarization
BN . . . . . . . . . . . . . Batch Normalization
BNN . . . . . . . . . . . . . Binarized Neural Network
BRAM . . . . . . . . . . . . . Block Random Access Memory
BWN . . . . . . . . . . . . . Binarized Weight Network
CGRA . . . . . . . . . . . . . Coarse-Grained Reconfigurable Architecture
CNN . . . . . . . . . . . . . Convolution Neural Network
CONV . . . . . . . . . . . . . Convolution layer
CPU . . . . . . . . . . . . . Central Processing Unit
CSC . . . . . . . . . . . . . Compressed Sparse Column
DDR . . . . . . . . . . . . . Double Data Rate
DIV . . . . . . . . . . . . . Division operation
DNN . . . . . . . . . . . . . Deep Neural Network
DSE . . . . . . . . . . . . . Design Space Exploration
D-SGD . . . . . . . . . . . . . Distributed synchronous Stochastic Gradient Descent
DSL . . . . . . . . . . . . . Data-Structure-Level
FC . . . . . . . . . . . . . Fully Connection
FIFO . . . . . . . . . . . . . First In, First Out
FLOPS . . . . . . . . . . . . . Floating Point Operations Per Second
FP . . . . . . . . . . . . . Floating-Point
FPGA . . . . . . . . . . . . . Field-Programmable Gate Array
GCN . . . . . . . . . . . . . Graph Convolutional Network
GCONV . . . . . . . . . . . . . Graph Convolutional layer
GNN . . . . . . . . . . . . . Graph Neural Network
GPU . . . . . . . . . . . . . Graphics Processing Unit
HBM . . . . . . . . . . . . . High Bandwidth Memory
HDL . . . . . . . . . . . . . Hardware Description Language
xxviii
HPC . . . . . . . . . . . . . High-Performance Computing
HW . . . . . . . . . . . . . Hardware
LUT . . . . . . . . . . . . . Look Up Table
MAC . . . . . . . . . . . . . Multiply Accumulate Unit
MGT . . . . . . . . . . . . . Multi Gigabit Transceivers
ML . . . . . . . . . . . . . Machine Learning
MUL . . . . . . . . . . . . . Multiplication operation
NL . . . . . . . . . . . . . Normalization Layer
NN . . . . . . . . . . . . . Neural Network
NOC . . . . . . . . . . . . . Network on Chip
NPE . . . . . . . . . . . . . Number of Processing Elements
OoO . . . . . . . . . . . . . Out of Order
PE . . . . . . . . . . . . . Processing Element
PIM . . . . . . . . . . . . . Processing In Memory
POPCOUNT . . . . . . . . . . . . . Population Count
PyG . . . . . . . . . . . . . PyTorch Geometric
QNN . . . . . . . . . . . . . Quantized Neural Network
RaW . . . . . . . . . . . . . Read after Write
ReLU . . . . . . . . . . . . . Rectified Linear Unit
RNN . . . . . . . . . . . . . Recurrent Neural Network
RTL . . . . . . . . . . . . . Register Transfer Language
SCNN . . . . . . . . . . . . . Sparse Convolution Neural Network
SGD . . . . . . . . . . . . . Stochastic Gradient Descent
SpMM . . . . . . . . . . . . . Sparse-dense Matrix Multiplication
SSE . . . . . . . . . . . . . Streaming SIMD Extensions





The development of Neural Networks (NNs) has achieved significant progress in the past
decade. Deep-Neural-Networks (DNNs) are in widespread use due to their ability to learn
well enough to achieve high accuracy [Geng et al., 2020, Ji et al., 2013, Geng et al.,
2018,Karpathy et al., 2014,Wang et al., 2020]. Deep learning paradigms such as Convolu-
tional Neural Networks (CNNs) [Krizhevsky et al., 2012] and Recurrent Neural Networks
(RNNs) [Mikolov et al., 2010] have been applied to a wide range of applications including
image classification, video processing, speech recognition, and natural language process-
ing [Kwon et al., 2019, Kwon et al., 2020, Geng et al., 2019a, Geng et al., 2018, Li et al.,
2019a]. These paradigms, however, are only able to extract and analyze latent informa-
tion from euclidean data such as images, video, audio, and text [Wu et al., 2020b, Wang
et al., 2020, Geng et al., 2021], while a large (and increasing) number of applications use
non-Euclidean data structures that are modeled as graphs. Therefore, Graph Neural Net-
works (GNNs) have been proposed, in various forms, to extend deep learning approaches
to graph data [Gori et al., 2005,Micheli, 2009,Scarselli et al., 2008,Li et al., 2015,Dai et al.,
2018,You et al., 2018,Abu-El-Haija et al., 2018,Wu et al., 2020b,Gao et al., 2018b]. GNNs
have already been investigated for a large number of real-world applications [Wu et al.,
2020b], including electric grid cascading failure analysis [Liu et al., 2020], prediction of
chemical reactivity [Coley et al., 2019], prediction of synthesized material properties [Xie
and Grossman, 2018], polypharmacy side-effect modeling [Zitnik et al., 2018], accurate
advertisement in E-commerce [Yang, 2019], and cybersecurity [Nguyen et al., 2018].
2
For NNs for both euclidean and non-euclidean data, most of their target applications
pose stringent constraints on latency, especially for the real-time applications, and require
high accuracy especially for the mission- & life-critical applications. To promote the de-
velopment and realize the wide-spread adoption of NNs in the real-world, real-time and
accurate NN inference is urgently required. Furthermore, besides inference, efficient train-
ing is also critical for NN development especially for the enhancement of accuracy, as it
lowers the complexity in exploring distinct network structures, training methodologies, and
provides efficient hyper-parameter search.
The evolution of computer architecture and the significant increase in the computing
capability of hardware platforms make possible faster-than-real-time inference and real-
time online training. To realize the potential brought by powerful hardware platforms,
researchers have devoted a great deal of effort to the acceleration of NNs using various
types of systems and platforms such as FPGA, GPUs, and Clouds, and even designing
specialized circuits for NN processing.
In order to pursue NN inference with both low latency and high accuracy, it is neces-
sary to first obtain efficient NN models. Ideally, the optimal models have the following 5
characteristics: (1) they provide satisfactory accuracy for target applications; (2) they use
the least number of bits (mixed data-widths at different channels and layers) to represent
features and weights; (3) their feature and parameter matrices have the fewest non-zeros;
(4) they have the fewest layers and channels; and (5) all operations performed during in-
ference are necessary and contribute to correct information analysis. As NNs are still not
comprehensively explainable, the raw NN models normally contain substantial redundancy.
Therefore, to achieve optimal models, it is necessary to first find redundant bits, data, op-
erations, and channels that contribute but slightly (or not at all) to the analysis of the latent
information and then either remove them during training or skip their processing during
inference.
3
In general, it is challenging to process optimal NNs efficiently. The challenge is largely
due to various types of irregularities introduced from the redundancy elimination. The rea-
son is the redundancies of NN models are normally unpredictable and irregular. Regardless
of whether these redundancies are eliminated during training or the redundant operations
are skipped during inference, the computation of optimal NN models becomes highly ir-
regular and hardware-unfriendly. Therefore, general hardware accelerators cannot process
optimal models with high efficiency so that the potential performance improvement brought
by redundancy elimination can not be sufficiently realized.
One commonly used approach to enhance hardware efficiency is to regularize the pro-
cess of redundancy elimination [Shi et al., 2020, Wang et al., 2018a, Ding et al., 2017]. In
particular, with this approach, the redundancies in the raw models are eliminated following
regular data and computation patterns. As a result, the resulting models become regular
and can be accelerated in hardware with high utilization, so that the potential performance
gain from redundancy elimination can be substantially realized. This approach, however,
has its own problem, i.e. accuracy is sacrificed to achieve more hardware-friendly models.
During regularized redundancy elimination, redundant bits, data, and operations are elim-
inated along with the informative ones to form the required regular data and computation
patterns. The resulting model’s accuracy decreases.
One typical example is unstructured versus structured compression. Unstructured com-
pression [Han et al., 2016] is an approach to obtain nearly-optimal models. It can re-
move redundancy without any constraints of compression patterns. Therefore, it is easy
to achieve high compression rates and has the potential to remove the redundant compu-
tations without hurting the accuracy. However, the compressed models are dramatically
irregular and therefore hardware-unfriendly. In contrast, structured compression [Wang
et al., 2018a, Ding et al., 2017], which is an example of regularized redundancy elim-
ination, removes redundancy following regular compression patterns. This guarantees
4
hardware-friendly models after compression. Nevertheless, to achieve comparable com-
pression rates with unstructured compression, structured compression inevitably removes
informative data processing, which leads to loss of accuracy.
The use of regularized redundancy elimination is useful to obtain low-latency NN
processing and for the adoption of NNs in many applications, especially for IoT and
smart-edge devices where reaching a certain well-defined level of accuracy is often suf-
ficient [Zhou et al., 2016, Hubara et al., 2016, Wei et al., 2017, Lu et al., 2017]. However,
many important real-world applications of Machine Learning (ML), e.g. safety-critical
applications such as weapon and flight control and mission-critical applications such as
online banking and power grid, have extremely strict requirements on accuracy. Therefore,
accelerators that can efficiently process optimal models with irregular computation and data
patterns without hurting the accuracy are highly desired. In this dissertation, we focus on
creating accelerators friendly to optimal and irregular models. As mentioned above, many
different types of platforms can be used to accelerate NNs. We use FPGAs as FPGAs have
great flexibility which is highly desired for handling irregularities in both computation and
communication.
We now describe common problems in NN processing and classify them into four types
of irregularities:
1. Model-level irregularity is caused by the use of different structures and configurations at
different layers. For most NN models, different layers have different numbers of channels
and sizes of feature maps and filters; they therefore require different numbers of arithmetic
operations. Model-level irregularity matters most in cluster-level NN processing, e.g. in
large-scale NN training.
2. Bit-level irregularity is caused by the use of different data widths at different layers
and even channels. As different channels and layers are in charge of extracting features
with different levels of importance, optimal models can use different numbers of bits to
5
represent features and parameters at different channels and layers.
3. Operation-level irregularity is caused by the data-dependant pruning opportunities of
superfluous operations in NN processing. Operation-level irregularity frequently occurs
at the activation functions of all types of NNs (which use threshold-based activation).
For most of NNs, the value of a certain neuron is partially (for non-quantized NNs) or
completely (for QNNs and BNNs) determined by comparing the accumulation of all dot-
products of the edges linked to this neuron with thresholds that have been determined dur-
ing training. For non-quantized NNs, the accumulation results are normally compared to 0
at the Rectified Linear Unit (ReLU) function for feature activation. For QNNs and BNNs,
the accumulation results are activated by being compared to a series of thresholds at Quan-
tization and Binarization functions. Our observation is as follows: we can immediately
cease further computation of the dot-product and return the proper results as soon as we
can determine the threshold-based comparison results. As the removed operations do not
contribute to the feature results, they are superfluous and hence can be skipped/pruned dur-
ing processing without hurting the accuracy. Note that the existence of these superfluous
operations is data-dependant and irregular; this is what introduces operation-level irregular-
ity to NN processing. Besides activation functions, operation-level irregularity also exists
at quantized pooling functions; this is discussed in detail in Chapter 5. As operation-level
irregularity is data-dependant and cannot be predicted offline, it must be handled at runtime
during inference.
4. Data-structure-level irregularity is caused by the irregular non-zero distribution of the
sparse feature and parameter matrices and results in workload imbalance in NN processing.
This type of irregularity is well-known and has been well-researched in DNN acceleration.
The emergence of Graph Neural Networks (GNNs) makes it even more imperative to create
new architectures that can efficiently handle data-structure-level irregularity. The reasons
are as follows: (1) the processing of GNNs, due to the inherent characteristics of graph
6
matrices, has much more drastic data-structure-level irregularity than general DNNs; (2)
unlike the parameter matrices in general DNNs, which have the potential to be regular-
ized during training with acceptable accuracy loss, the regularization of GNNs’ adjacency
matrices normally hurts the accuracy significantly.
In this dissertation, we address these four types of irregularities with FPGAs by config-
uring their logic to match the irregular problem types. In particular, we create four novel
architectures that can efficiently handle the four types of irregularities respectively, so that
the optimal models with significant irregularities can be accelerated with almost perfect
hardware efficiency without loss on accuracy. Our thesis is that high-accuracy and high-
performance NN inference and training can be achieved by creating a series of novel
irregularity-aware architectures for FPGAs.
The four irregularity-aware architectures created in this dissertation are summarized as
follows:
1. AWB-GCN (Autotuning Workload Balancing Graph Convolutional Network) archi-
tecture supports runtime workload distribution autotuning to handle Data-Structure-Level
(DSL) irregularity.
Sparse Matrix Multiplication (SpMM) and sparse matrices are the major computational
kernel and data structure used in the processing of modern NNs. The irregular distribution
of non-zero elements of the weight, feature, and adjacency matrices of NN models (i.e.
DSL irregularity) makes efficient acceleration challenging. DSL irregularity exists widely
in general DNNs such as Sparse CNNs (SCNNs). Many well-known accelerators that are
efficient to accelerate SCNNs have been developed in the past decade [Han et al., 2016,
Zhang et al., 2016b,Kim et al., 2017]. GNNs, however, differ fundamentally from SCNNs:
the matrices are normally much larger, much sparser, and follow power-law distributions
which leads to much more serious DSL irregularity. SCNN accelerators, therefore, are not
efficient when applied to GNNs. More powerful architectures which can handle GNN-level
7
DSL irregularity are much needed.
In this dissertation, we create AWB-GCN architecture that can efficiently process
GNNs and address the issues introduced by the GNNs’ serious DSL irregularity. We
take Graph Convolutional Network (GCN) as a study case, as GCN is one of the most
popular GNNs and has the same computational kernels as the other GNNs. AWB-GCN
supports runtime workload autotuning and can process extremely irregular sparse matrices
with over 90% system utilization on FPGAs. Furthermore, AWB-GCN guarantees high
off-chip bandwidth utilization, which is critical for graph processing. AWB-GCN’s ex-
cellent performance on GNN-level DSL irregularity also implies that it can perform even
better when processing other types of NNs having less significant irregularities.
2. O3BNN-R (Out-of-Order scheduling-based Binarized Neural Networks) architec-
ture supports runtime detection and efficient pruning of operational redundancy to handle
Operation-Level irregularity.
Operation-level irregularity emerges from the random pruning opportunities of super-
fluous operations, which have to be found at runtime. The efficient harvesting of these
opportunities is challenging, as they are irregular, data-dependent, and strongly dependent
on the specific evaluation order of the edges. Exploiting these opportunities requires the de-
sign to be extremely flexible and dynamic. It is even more difficult to skip these operations
without hurting the overall system efficiency. Many hardware designs have been proposed
in the past decade to handle the operation-level irregularity in non-quantized DNNs [Song
et al., 2018, Akhlaghi et al., 2018] by adopting heavy hardware predictors and dynamic
parameter distribution systems. However, for lightweight NNs, especially BNNs (which
are mainly based on bit-level operations and have the most significant operation-level ir-
regularity), these proposed hardware solutions introduce too much hardware overhead.
We address operation-level irregularity by proposing the O3BNN-R architecture which
can, at runtime, detect and skip most of the superfluous operations with almost perfect
8
utilization and negligible hardware overhead. We use BNNs as a case study to demonstrate
the efficiency of O3BNN-R, as BNNs have the most fine-grained and irregular operational
redundancy and require the most lightweight hardware to handle the irregularity. O3BNN-
R only skips superfluous operations. Therefore, it will not hurt the models’ accuracy. Note
that the design of O3BNN-R is useful in general to handle operation-level irregularity in
different types of NNs.
3. CQNN (CGRA-based QNN) architecture supports Coarse-Grained Reconfigurable Ar-
chitecture (CGRA)-based run-time data-precision conversion to handle bit-level irregular-
ity.
The use of different data-widths at different layers (and even channels) is helpful in
improving the efficiency of NN models. For instance, this flexibility enables models to be
able to meet the strict latency and accuracy requirements of real-world applications. How-
ever, it also introduces bit-level irregularity to NN processing. It is challenging to design
an accelerator that is able to perform any combinations of mixed-data-width (also called
mixed-precision) operations efficiently. Most existing architectures are designed for net-
works with specific configurations [Wang et al., 2018]. These architectures have poor flex-
ibility. Some other accelerators are designed more generally, e.g., to support programming
by users for different NNs [Jouppi et al., 2017]. These designs provide good flexibility, but
often lose efficiency due to their general architectures.
In this dissertation, we propose a novel CGRA-based QNN acceleration framework,
CQNN, to handle bit-level irregularity in mixed-precision QNNs. Taking advantage of
the CGRA architecture, CQNN provides both high performance and good flexibility in
the processing of mixed-precision QNNs and handles the bit-level irregularity well. By
programming CQNN at runtime, the configurations of the CQNN processing elements can
be dynamically reconfigured to best match to the QNN models being processed.
4. FPDeep, an FPGA-cluster-based system, delivers high-performance and high-accuracy
9
training and can handle model-level irregularity.
Model-level irregularity mainly matters in cluster-level NN processing, most of which
is used for large-scale training. Scaling DNN training to larger clusters is generally done by
distributing tasks in batch mode using methods such as distributed synchronous Stochastic
Gradient Descent (SGD). Among the issues with this approach is that, to make the dis-
tributed cluster work with high utilization, the workload distributed to each node must be
large; this implies nontrivial growth in the SGD mini-batch size, which decreases the con-
vergence rate and test accuracy. Certain methods can somewhat reduce this loss of accuracy
– e.g., using dynamic batch sizes and fine-tuning the learning rate – but they do not solve
the problem [Goyal et al., 2017]. To solve this problem fundamentally, model/layer paral-
lelism has started to draw sufficient attention in both academia and industry [Huang et al.,
2019,Narayanan et al., 2019,Wang et al., 2020,Geng et al., 2018], as it enables small-batch
training on large-scale systems. However, the model-level irregularity of neural networks
makes it challenging to map DNN training logic onto distributive systems with balanced
workloads and further makes it difficult to achieve high system efficiency. A systematic
approach which can handle the model-level irregularity and train neural networks distribu-
tively on a large scale system efficiently is highly desired.
We propose a novel FPGA-cluster-based training framework, FPDeep, which handles
the model-level irregularity efficiently and delivers high accuracy NN training with great
scalability and high throughput. Particularly, FPDeep adopts a hybrid of layer and model
parallelism together with a number of new workload/weight balancing strategies. Each
device computes only certain layers or a part of a single layer; each device is optimized
independently with respect to its own computation. The cluster is a single fine-grained
pipeline so the batch size can be arbitrarily small. All computational nodes run with almost
their peak theoretical performance. The throughput of the proposed design scales linearly.
Given small batch does not suffer from test accuracy decay and decrease of convergence
10
rate, the higher throughput the system can deliver, the faster training we can have.
The overall contributions of this dissertation are as follows:
• Define Problem: we characterize common problems in NN processing and classify
them into four types of irregularities.
• New Architectures: we have come up with a set of new HW architectures that im-
prove performance by handling the types of irregularities.
• Performance: we successfully demonstrate the proposed the methods improve the
performance dramatically (2.1× to 3255× compared with carefully selected prior
art).
• Advance a computing technology for NNs: In some of these applications (large
scale DNN training and floating-point-based GNN inference) we have demonstrated,
where it was not known before, that FPGAs are competitive or superior to other
computing technologies.
The rest of the dissertation is organized as follows. We start with the background and
preliminary work in Chapter 2. After that, the next 5 chapters follow roughly the outline of
the four architectures to address the four levels of irregularity in NN processing. Chapter
3 introduces the architecture of AWB-GCN which handles data-structure-level irregular-
ity. Chapter 4 introduces the architecture of LP-BNN which is a preliminary version of
O3BNN-R. Chapter 5 presents the architecture of O3BNN-R and how it handles operation-
level irregularity. Chapter 6 describes the architecture of CQNN and its solution to bit-level
irregularity. Chapter 7 discusses the design of FPDeep which handles model-level irreg-
ularity in DNN training. Chapter 8 introduces CSB-RNN, an RNN accelerator based on
model regularization and hardware accelerator co-design. By presenting CSB-RNN, we
demonstrate that, although hardware-only solutions that are friendly to irregular models
11
are promising and can be used in a wide variety of applications, model regularization can
still be a valuable approach in certain scenarios. Finally, Chapter 9 summarizes the entire




In this chapter, we provide the backgrounds of various types of NNs, FPGAs, and the use
of FPGAs in NN acceleration. We first introduce the structure and characteristics of GCNs
which are used to evaluate AWB-GCN architecture. We then provide the preliminary of
BNNs and QNNs which are used to evaluate O3BNN-R and CQNN respectively. We also
discuss the background of CNN training which is used in the evaluation of FPDeep. Finally,
we introduce the architecture of FPGAs and the use of FPGAs in NN acceleration.
2.1 GCN Structure and Characteristics
This section briefly introduce the GCN algorithm and discuss data characteristics of power-
law graphs.
2.1.1 Graph Convolutional Network Structure
Equation 2.1 shows the layer-wise forward propagation of a multi-layer spectral GCN [Wu
et al., 2020b, Kipf and Welling, 2016]:
X (l+1) = σ(AX (l)W (l)) (2.1)
A is the graph adjacency matrix with each row delineating the connection of a vertex with
all the other vertices in the graph. X (l) is the matrix of input features in layer-l; each column
of X represents a feature while each row denotes a node. W l is the weight matrix of layer-l.
σ(.) denotes the non-linear activation function, e.g., ReLU [Krizhevsky et al., 2012]. In
13
general A needs to be normalized: Ã = D−
1
2 × (A+ I)×D− 12 where I is the identity matrix,
and Dii = ∑Ai j. The reason is that, without normalization, multiplying the feature vector
X (l) by A will change its scale: those nodes with more neighbors tend to have larger values
under feature extraction. Note that during both training and inference of GCN, Ã remains
constant. Since Ã can be computed offline from A, in the remainder of this dissertation we
use A to denote the normalized Ã. In general A is multiplied only once per layer. However,
when multi-hop neighbor information is to be collected, A can be multiplied twice or more
(i.e., A2, A3, etc.) per layer.
Equation 2.1 is derived from graph signal processing theory: convolutions on a graph
can be converted to a multiplication of signal x ∈ RN (i.e., a scalar for each node) and a
filter g ∈ RN in the frequency domain via the Fourier transform:
CONV (g,x) = F −1(F (x)F (w)) =U(UT xUT g) (2.2)
where  denotes the Hadamard product. U is a collection of eigenvectors for the normal-




2 = UΛU . The diagonal matrix Λ comprises the
eigenvalues. If a frequency domain filter gW = diag(W ) is defined, then Equation 2.2 can
be simplified [Bruna et al., 2014] as:
CONV (gW ,x) =UgWUT x (2.3)
Equation 2.3 can be further simplified by defining the filter as the Chebyshev polynomi-
als of the diagonal matrix Λ [Defferrard et al., 2016, Kipf and Welling, 2016] to obtain
Equation 2.1.
Figure 2·1 illustrates the structure of a Graph Convolutional layer (GCONV). Each
GCONV layer encapsulates the hidden features of nodes by aggregating information from
neighbors of nodes. By multiplying A and X (l), information from 1-hop connected neigh-
boring nodes are aggregated. By multiplying AX (l) with W (l), and going through the non-
14
Figure 2·1: Illustration of a GCONV layer in GCNs.
linear activation function σ(·), we obtain the output of this layer, which is also the feature
matrix for the next layer X (l+1). The matrix A will normally be the same in different lay-
ers. After multiple layers, the GCN is able to extract very high-level abstracted features for
various learning purposes.
2.1.2 Characteristics of Power-Law Graphs
Real-world graphs in many critical domains typically follow the power-law distribution
[Xie et al., 2014, Aiello et al., 2001, Chung et al., 2004, Adamic et al., 2001], which states
that the number of nodes y of a given degree x is proportional to x−β for a constant β > 0.
This implies that in the adjacency matrix A, a small number of rows (or columns) include
the majority of non-zeros whereas the majority of the rows (or columns) contain only a few
non-zeros but are not empty. Figure 2·2 shows the distribution of non-zero elements for the
five publicly available datasets that are widely used for GCN evaluation [Kipf and Welling,
2016]. The power-law effect is prominent for Cora, Citeseer, Pubmed and Nell.
Table 2.1 lists the density and dimension of matrices in the five GCN datasets used in
this chapter. Note that adjacency matrix A is always very sparse (≥ 99%). Matrix X is
also usually sparse. For the first layer, the sparsity (X1) is usually larger than 90%. As the
weight matrix W is dense, the output of AXW is also dense. However, because of the ReLU
activation function, the final output X2 (also the input of the next layer) becomes sparse
15
Figure 2·2: Non-zero distribution imbalance of Adjacency matrices in
Cora, Citeseer, Pubmed, Nell and Reddit datasets.
Table 2.1: Matrix density and dimensions of 5 widely-used GCN datasets.
CORA CITESEER PUBMED NELL REDDIT
Density
A 0.18% 0.11% 0.028% 0.0073% 0.21%
W 100% 100% 100% 100% 100%
X1 1.27% 0.85% 10.0% 0.011% 100%
X2 78.0% 89.1% 77.6% 86.4% 63.9%
Dimension
Node 2708 3327 19717 65755 232965
Feature 1433 3703 500 61278 602
but with sparsity usually less than 50%. The sizes of the matrices in GCNs depend on the
dataset and can range from thousands to millions or more. A can be extremely large and is
stored in a sparse format.
2.2 Binarized Neural Network and Quantized Neural Network Struc-
ture
BNNs are an extreme case of QNNs and evolved from conventional CNNs through Bi-
narized Weight Networks (BWN) [Courbariaux et al., 2015] with the observation that if
16
the weights were binarized to 1 and −1, expensive FP multiplications could be replaced
with additions and subtractions. It was next observed that if both weights and inputs were
binarized, then even the 32-bit additions and subtractions could be demoted to logical bit
operations; XNOR-Net was proposed and has become one of the most researched BNNs. In
XNOR-Net, both the weights and the inputs of the convolutional and fully connected layers
(except the first layer) are approximated with binary values, allowing efficient implemen-
tation of convolutional operations via exclusive-NOR (XNOR) and bit-counting [Rastegari
et al., 2016,Courbariaux et al., 2016]. In this dissertation, we focus on XNOR-Net and use
the terminology from XNOR-Net [Rastegari et al., 2016].
Figure 2·3: The structures of BNNs with original BN in [Rastegari et al.,
2016]
In their basic structure BNNs have four essential functions in each CONV/FC layer:
XNOR, Population Count (POPCOUNT), Batch Normalization (BN), and Binarization
(BIN) (see Figure 2·3). The weights, inputs, and outputs are binary so multiply-accumulate
in traditional DNNs becomes XNOR and POPCOUNT in BNNs. The output of POP-
COUNT is normalized in BN as this is compulsory for obtaining high accuracy with BNNs.








· γ j +β j (2.4)
The normalized outputs from BN (i.e., yi, j), which are FP, are binarized in BIN by
comparing with 0:




1 if x≥ 0
−1 otherwise
(2.5)
Here, BIN acts as the nonlinear activation function. Max pooling can be required. Tra-
ditionally, pooling is between BN and BIN. It can be shown, however, that this is equivalent
to placing pooling after BIN; thus the FP operations in pooling become bit-OR operations,
significantly reducing computation complexity. Note that the complex BN and BIN func-
tions can be merged as a threshold-based comparison operation, which significantly reduce
the computation complexity of BNNs. We will discuss this optimization in Section 4 and 5
in detail.
The structure of BNNs can be easily extended to QNNs. As QNNs use multiple bits
to represent features and parameters, the XNOR and POPCOUNT functions in BNNs be-
come low-precision Multiplication (QMUL) and Accumulation (QACC); the BIN function
in BNNs becomes Quantization (QUANT). The functions of QNNs are all based on opera-
tions with a limited number of bits, e.g. 1-8. Figure 2·4 illustrates QNN structure. Similar
to BNNs, the QUANT and BN functions of QNNs can also be merged into threshold-
based BN (QT-BN). Different from BNNs, T-BN in a q-bit QNN layer contains at least
(2q− 1) thresholds per channel and requires at least q times comparisons to quantize an
output feature. All these thresholds can be decided during training. Therefore, they can be
treated as constants in the acceleration of inference [Blott et al., 2018, Lam et al., 2020].
As QNNs and BNNs share similar structures, in CQNN (see Chapter 6), we decompose
QNN operations into multiple BNN operations and try to realize QNN hardware modules
18
by integrating binary components.
Figure 2·4: A simple 3-CONV-1-FC BNN Network structure. It is simi-
lar to DNN, except that Activation acts as QUANT. QNN model structure
can be further optimized. Floating-point BN and QUANT functions can be
merged into QT-BN with multiple thresholds.
2.3 Distributive CNN Training
The computations for CNN training are shown in Fig. 2·5. The red datapath shows
Forward-Propagation (FP). It calculates the errors of output features in the final layer. Start-
ing with an input image (Cat), neurons in each layer are evaluated with parameters Pai.
Errors are calculated by comparing inference results to the label in the training dataset.
BP has two sub-steps: Error Back-propagation (EB-green) and Parameter Gradient (PG-
orange). In EB, errors are back-propagated through the network. In PG errors of each layer
are used to calculate gradients of the weights (∂Err
∂Pai
). The convolution kernels are called
parameters, the temporal convolution results are called activations. The notation used in
Chapter 7 is shown in Table 2.2.
Fig. 2·6 zooms-in on the FP, EB, and PG operations of two layers. As shown in Tab.
2.2, we use A, Pa, dP, and E to represent Activations, Parameters, Differentials of the
Parameters, and errors, respectively. The relationship among of these in CNN training is
shown in Eqns. 2.6, 2.7, 2.8 and 2.9.
19
Figure 2·5: Illustration of computations involved in CNN training including















Layer lLayer l-1 Layer l+1
Update
Figure 2·6: Execution details of forward and backward propagation with
zoom-in on adjacent CONV layers, operations, and data dependencies.
1. For FP the activation of layer l’s channel c is generated by summing all related
convolution results of layer l−1’s activations and layer l’s parameters (Eqn. 2.6).
2. For EB the error of layer l’s channel c is generated by summing all related convolu-
tion results of layer l +1’s errors and layer l’s parameters (Eqn. 2.7).
3. For PG the differentials of layer l’s parameters are the convolution results of this
layer’s error and the previous layer’s activation (Eqn. 2.8). The differentials of the param-
20







dP[b][l][p][q] = A[b][l−1][q]∗E[b][l][p] (2.8)
Pa[b+1][l][p][q] = dP[b][l][p][q]+Pa[b][l][p][q] (2.9)







Differential of the parameter[batch-id][layer-id][output-channel-
id][input-channel-id]
E[b][l][c] Error of the layer[batch-id][layer-id][channel-id]
W Size of the activation
K Size of the convolution kernel
IC Number of the input channels
OC Number of the output channels
2.4 FPGAs In Neural Network Acceleration
2.4.1 FPGA Architecture
FPGAs are integrated circuits that can be freely configured by the users after manufactur-
ing. They are also the most widely used reconfigurable device in the world due to its high
performance, low power, reconfigurability support, but also established integration in the
21
communication stack both for routers [Bolaria and Byrne, 2009] and for network facing
components [Caulfield et al., 2016, Eran et al., 2019].
FPGA normally consists of five types of hardware resources for efficient computation
and communication [Hauck and DeHon, 2008].
1. DSP units are generally used to perform high-precision and high-performance multipli-
cation, addition, and multiply-accumulation operations.
2. Look-Up-Tables (LUTs) can be used to provide both flexible computation and high-
concurrency data storage.
3. Flip-Flops (FFs) are used as registers.
4. Block RAMs (BRAMs) provide tens of GByte on-chip storage.
5. Multi Gigabit Transceivers (MGTs) provide efficient inter-FPGA communication.
Each FPGA chip normally has hundreds of high-bandwidth (over 20Gb/s for each MGT)
and low-latency MGTs as I/Os. These I/Os can be directed connected to other on-chip
resources.
These hardware resources are embedded in a hierarchical and programmable on-chip
interconnect network, so that users can freely integrate these resources to match their target
problems by programming the interconnect network. FPGAs’ flexibility has made them a
competitive platform to be widely used in High-Performance Computing (HPC) and ML
acceleration [Gokhale and Graham, 2005, Herbordt et al., 2007, Herbordt et al., 2008, Van-
Court and Herbordt, 2009, Benkrid and Vanderbauwhede, 2013].
FPGAs are often compared with GPUs, the most widely used acceleration devices.
Although very different conceptually, in practical use FPGAs often resemble GPUs: the
hardware resources of FPGAs are normally integrated into a massive number of parallel
computing units, as are GPUs’ Streaming Multiprocessors. In FPGAs, each computing
unit consists of computation pipelines realized with LUTs, DSPs, and FFs, and local mem-
ories realized with BRAMs and LUTs. BRAMs can also be used as global scratchpad
22
memory which is shared by all computing units, as are the caches in GPUs. However, from
another perspective, FPGAs are also very different from GPUs and have a number of ad-
vantages. The computing units of FPGAs can be customized to perfectly match the specific
target problems and therefore have the potential to work with almost 100% efficiency; plus,
FPGAs have a flexible and customizable interconnect, so that the computing units can be
freely connected. There is no restriction on the inter-computing unit communication, other
than the physical limitations of the interconnect, which makes FPGA a good choice to
handle irregularity problems.
Given FPGAs’ unmatched communication support it is natural that they be configured
directly into FPGA-centric clusters. Much research has gone into modeling and building
such clusters, including specifying interconnect types, e.g. whether direct (FPGA-FPGA)
or indirect (through a router) [Sheng et al., 2015, Sheng et al., 2016, Sheng et al., 2017b,
Sheng et al., 2017a, Putnam, 2014, Plessl, 2018, Boku et al., 2019].
2.4.2 FPGA in Neural Network Acceleration
Due to the aforementioned benefits, FPGAs play critical roles in HPC and NN acceler-
ation [VanCourt and Herbordt, 2007, VanCourt and Herbordt, 2006c, VanCourt and Her-
bordt, 2005a, VanCourt and Herbordt, 2005b, VanCourt and Herbordt, 2004, Sanaullah
et al., 2018c, Sanaullah et al., 2018a]. While not yet having the penetration of GPUs in
HPC, FPGAs have shown the potential to be a part of the next-generation HPC systems.
Researchers have successfully demonstrated the efficiency and benefits of FPGAs in many
important scientific computing applications such as Molecular Dynamics [VanCourt et al.,
2004, VanCourt and Herbordt, 2006b, Sukhwani and Herbordt, 2008, Sukhwani and Her-
bordt, 2009,Chiu et al., 2008,Chiu and Herbordt, 2009,Chiu and Herbordt, 2010,Chiu et al.,
2011, Yang et al., 2019a, Yang et al., 2019b, Yang et al., 2017a, Wu et al., 2020a], Adaptive
Mesh Refinement [Wang et al., 2019c, Wang et al., 2019d], and Algebraic Multigrid [Van-
Court and Herbordt, 2006a,Haghi et al., 2020a], as well as security [Wolfe et al., 2020,Pa-
23
tel et al., 2020]. The programmability of FPGAs is often believed to be an obstacle to the
wide adoption of FPGAs in the real HPC system. However, researchers have demonstrated
that this can be addressed in better design tools [Sanaullah et al., 2018b, Sanaullah and
Herbordt, 2018a, Sanaullah and Herbordt, 2018b, Herbordt, 2019] and middleware [Haghi
et al., 2020c, Haghi et al., 2020b, Xiong et al., 2020].
As for NN acceleration, FPGAs have become the most commonly used device in both
research prototyping and real-world adoption of NN inference. As introduced in Chapter 1,
NN acceleration can be performed with model regularization. Researchers have proposed
various types of model regularization approaches for CNNs and RNNs and successfully
accelerate the regularized models efficiently with FPGAs [Ding et al., 2017, Shi et al.,
2020, Wang et al., 2018a, Li et al., 2019b]. As for the highly irregular models without
regularization, many well-known studies rely on FPGAs’ flexibility to efficiently handle
the irregularities and deliver real-time inference [Han et al., 2016, Wang et al., 2018, Geng
et al., 2021, Geng et al., 2019b, Geng et al., 2020].
In contrast with NN inference, NN training is mainly conducted on GPU and CPU
clusters, in part due to their better programmability [Huang et al., 2019, Narayanan et al.,
2019]. However, many efforts have been devoted to training NN models with FPGA clus-
ters and have demonstrated that FPGAs are competitive or superior to other computing
techniques [Geng et al., 2018, Wang et al., 2020].
24
Chapter 3
AWB-GCN: A Graph Convolutional Network
Accelerator with Runtime Workload Autotuning
This chapter introduces the architecture of AWB-GCN (a GCN accelerator with
Autotuning-based Workload Balance). AWB-GCN is able to handle GNN-level DSL irreg-
ularity as it is equipped with a novel runtime workload distribution autotuning technique.
This chapter is based on the work published in the 53rd Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO) ©2020 IEEE [Geng et al., 2020].
3.1 Introduction
As mentioned in Chapter 1, classical deep learning paradigms such as CNNs and RNNs
[Krizhevsky et al., 2012, Mikolov et al., 2010] do not have the capability to analyze latent
information from non-Euclidean data structures. As a result, the adoption of neural net-
works is greatly limited in fields with complex relationships among objects. However, a
large and increasing number of real-world applications use non-Euclidean data structures
that are modeled as graphs. Nodes and edges represent objects and relationships between
those objects, respectively, as appropriate for the application. Most of these graphs have
a tremendously large numbers of nodes; moreover, the node degree generally varies dra-
matically, often following a power law distribution [Gonzalez et al., 2012,Abou-Rjeili and
Karypis, 2006,Latapy, 2008,Xie et al., 2014,Aiello et al., 2001,Chung et al., 2004,Adamic
et al., 2001].
The irregularity of the graph data makes most of the existing NN algorithms ill-suited;
25
critical feature extraction operations, such as convolutions, are no longer applicable. To
tackle this issue, GNNs have been proposed to extend deep learning approaches to graph
data [Gori et al., 2005, Micheli, 2009, Scarselli et al., 2008, Li et al., 2015, Dai et al., 2018,
You et al., 2018, Abu-El-Haija et al., 2018, Wu et al., 2020b, Gao et al., 2018b]. Among
various GNNs, the Graph Convolutional Network (GCN), an approach that marries some
ideas of CNNs to the distinct needs of graph data processing, has demonstrated significant
potential and become one of the most important topics in NN-based graph research [Henaff
et al., 2015, Bruna et al., 2014, Defferrard et al., 2016, Kipf and Welling, 2016, Yun et al.,
2019].
With the rapid development of GCNs, designing dedicated hardware accelerators has
become an urgent issue [Yan et al., 2020]. GCNs have already been investigated in a large
number of real-world applications [Wu et al., 2020b], including electric grid cascading
failure analysis [Liu et al., 2020], prediction of chemical reactivity [Coley et al., 2019], and
cybersecurity [Nguyen et al., 2018]. Many of these applications pose stringent constraints
on latency and throughput.
As discussed in Chapter 1, accelerators developed for other domains, such as the SCNN
accelerator [Han et al., 2016, Zhang et al., 2016b, Kim et al., 2017], are not likely to be
optimal as GCN accelerators. There are several reasons. (i) GCN matrices have highly
unbalanced non-zero distributions. non-zeros can be regionally clustered which leads to
computing challenges and low system utilization [Gonzalez et al., 2012, Abou-Rjeili and
Karypis, 2006, Latapy, 2008] due to workload imbalance [Xie et al., 2014]. Figure 3·1
compares the distribution of non-zeros between a typical adjacency matrix in a GCN and
a typical sparse-weight matrix in a CNN: the distribution of non-zeros is much more bal-
anced in the SCNN. (ii) GCN matrices have extremely high sparsity. The sparsity of the
adjacency matrices can be over 99.9%, while the sparsity of SCNNs is normally below
50%. Therefore, in GCNs the indices of consecutive non-zeros are often highly scattered
26







































Ordered NNZ Elements per Row
GCN SCNN
4550
Figure 3·1: Histograms show ordered non-zero per-row density. Left: Ad-
jacency matrix of the NELL graph (avg. density: 0.0073%) has most of non-
zeros clustered in 70/66k rows. Right: Unstructured compressed AlexNet
weight matrix (avg. density: 27%) has workload roughly balanced across
384 rows.
which makes it very challenging to multiplex enough number of valid pairs of non-zeros at
each cycle to keep a parallel compute system busy especially when the system is large. (see
Section 3.5). (iii) Large matrix size. Real-world graphs can be very large. For example,
the Reddit graph has 233K nodes and 115M edges. Its 233K×233K adjacency matrix re-
quires 1.7Tb storage in dense format or 11.0Gb in sparse format, which requires the use of
off-chip memory. Although classical neural networks also have large models, the matrix of
a particular layer is much smaller and often can fit easily into on-chip memory. Figure 3·1
compares the matrix sizes of CNNs and GCNs. Furthermore, it is easier to compress and
optimize the matrices of classical neural networks during training phase than the ones of
GCNs.
For these reasons, novel and efficient accelerator designs are urgently required to accel-
erate GCN workloads. We therefore propose AWB-GCN, a hardware accelerator for GCN
inference with workload auto-tuning. It monitors the workload distribution at three levels
at runtime and, accordingly, rebalances the distribution per round. In AWB-GCN design,






    Least 
Clustered
Figure 3·2: Adjacency matrix of NELL following power-law distribution:
elements are clustered regionally and in a few rows/cols. The matrix density
is 0.0073%. For better visualization, non-zero dots are enlarged.
Three techniques are proposed: distribution smoothing, remote switching, and evil row
remapping. Distribution smoothing balances the workload among neighbors. In matrices
following the power-law distribution, non-zero elements are usually clustered, and, in some
cases, appear in just a few rows/columns (Figure 3·2). Given only distribution smoothing,
it would be slow and difficult for an autotuner to converge and achieve good load balance.
We solve this problem with remote switching and evil row remapping. Remote switching
shuffles workloads of regions with the most and least clustered non-zero elements, making
efficient distribution smoothing possible. If a row is observed to still contain too many
elements to be smoothed or balanced by remote switching, it is designated as an evil row.
AWB-GCN partitions that row and remaps its non-zero elements to multiple regions (with
least clustered elements). Figure 3·3 shows the resulting per-round improvement in hard-
ware utilization as these methods are applied (Nell GCN).
28
Figure 3·3: AWB-GCN utilization improvement per round.
This Chapter makes the following contributions:
• We propose a novel and efficient architecture for accelerating GCNs and SpMM
kernels for matrices with a power-law distribution.
• To handle the extreme workload imbalance, we propose a hardware-based workload
distribution autotuning framework, which includes an efficient online workload pro-
filer and three workload rebalancing techniques.
• We evaluate AWB-GCN using an Intel D5005 FPGA Acceleration Card with five of
the most widely used GCN datasets. Results show that 4K-Processing Element(PE)
AWB-GCN improves the PE utilization on average by 7.7× as compared with the
baseline without workload rebalancing. Compared with Central Processing Units
(CPUs) (Intel Xeon E5-2680v3 + PyTorch Geometric (PyG)), Graphics Process-
ing Units (GPUs) (NVIDIA Quadro RTX 8000 + PyG), and prior art, AWB-GCN
achieves average speedups of 3255×, 80.3×, and 5.1×, respectively.
The organization of this chapter is as follows. In Section 3.2, we introduce the baseline
architecture of AWB-GCN. Section 3.3 present the overall architecture of AWB-GCN with
run-time workload auto-tuning. The efficiency of AWB-GCN is evaluated in Section 3.4.
Section 3.5 discusses the related works. Section 3.6 concludes this work. Note that the
background of GCNs is given in Chapter 2.
29
3.2 GCN Baseline Architecture
This section introduces the multi-core baseline architecture for GCN acceleration. This
baseline supports efficient processing of power-law graphs with ultra-high sparsity and
large sizes. This design alone cannot address the workload imbalance issue of power-law
graphs, but builds a foundation for its further augmentation, described in the next section,
which achieves near-optimal workload balancing.
3.2.1 Matrix Computation Order
As presented in Chapter 2, the computation of each GCN layer consists of at least 2 consec-
utive SpMMs: A×X×W (see Equation 2.1). To compute A×X×W at each GraphCONV
layer, there are two alternative computation orders: (A×X)×W and A× (X ×W ). The
choice is significant as it dictates the volume of non-zero multiplications. Based on pro-
filing, A is ultra sparse and large, X is generally sparse and usually has a large number of
columns, and W is small and dense. For (A×X)×W , since multiplying A and X requires
complex sparse-sparse-matrix-multiplication and produces a very large dense matrix, mul-
tiplying their product by another dense matrix W leads to significant computation workload
and long delay. Alternatively, for A× (X ×W ), both are sparse-dense matrix multiplica-
tions (SpMM) and the scale of computation is drastically smaller. Table 3.1 lists the amount
of computation for the five datasets following the two approaches. Since the difference is
quite obvious, in this design we first perform X×W and then multiply with A.
Table 3.1: Operations required under different exec orders.
Layer Order CORA CITESEER PUBMED NELL REDDIT
Operations
(A×X)×W 62.8M 198.0M 165.5M 258G 83.3G
A× (X×W ) 1.33M 2.23M 18.6M 782M 21.4G
30
3.2.2 SpMM Execution Order and Mapping
We perform column-wise-product-based SpMM [Gao et al., 2020,Deveci et al., 2017,Chen
et al., 2019] as described as follows. Given S×B =C, if S is (m×n), B is (n×k), and C is













S jb( j,k−1)] (3.1)
where S j is the jth column of S and b j,k is an element of B at row- j and column-k. In other
words, by broadcasting the jth element from column-k of B to the entire column- j of S, we
can obtain a partial column of C. Essentially, B is processed in a streaming fashion: each
element b( j,k) finishes all computation it involves at once and is then evicted. In this way,
we reuse the entire sparse matrix S for each column of C (k times in total). To reduce off-
chip memory access for matrix S, we apply inter-layer data forwarding and matrix blocking
techniques (discussed in Section 3.2.4).
This design has additional advantages when S and C are stored in Compressed-Sparse-
Column (CSC) format. Furthermore, it provides opportunities to pipeline multiple SpMM
operations, as is discussed in Section 3.2.4. Moreover, column-wise-product brings mas-
sive opportunities of workload distribution autotuning which is key to achieving high per-
formance. Figure 3·4(A) shows the column-wise order for calculating C. The columns of
S and elements of B in the same color are multiplied and stored as partial results in C with
the same color.
In the baseline design, with the assumption that non-zeros are evenly distributed among
the rows, we use a direct and static mapping from matrix rows to PEs to avoid expensive
parallel reduction in hardware as illustrated in Figure 3·4(B).
31
Figure 3·4: (A) SpMM computation order: Column-wise-product; (B) Ma-
trix partitioning & mapping among PEs.
3.2.3 Design of Baseline Architecture
Figure 3·5 illustrates the baseline design for SpMM calculation with efficient support of
skipping zeros. The architecture comprises the modules sparse-matrix-memory (SpM-
MeM), dense-column-memory (DCM), task-distributor & Queue (TDQ), PE-array, and
accumulation-buffers-array (ACC Buffer). SpMMeM buffers the input sparse matrix S
(from off-chip) and feeds non-zeros and their indices to TDQ. DCM buffers the input dense
matrix B and broadcasts its elements to TDQ. TDQ distributes tasks to the PEs. The PE-
array performs concurrent multiplication of non-zero pairs, partial result accumulation, and
data exchange with the ACC Buffers. Finally, the ACC Buffers cache the partial results of
the resulting matrix C for accumulation and send them to the next SpMM engine at the
completion of a whole column calculation. Depending on the sparsity and storage format
of S, i.e., CSC, we have two alternative designs for TDQ:
TDQ-1 (Figure 3·5-left) is used when S is generally sparse (sparsity < 75%) and stored in
dense format. We perform the direct row partition as discussed and map non-zeros to the
input buffer of the corresponding PEs (Figure 3·4(B)). In each cycle, NPE/(1−Sparsity)
elements are forwarded to the PE array. Only non-zeros are kept in the queues. Here NPE
denotes the number of parallel PEs. Given evenly distributed non-zeros, each PE receives
32
Figure 3·5: Architecture of the proposed baseline SpMM engine.
one non-zero per cycle to calculate. In practice, however, the distribution can be very
imbalanced and each PE has the chance to receive at most 1/(1− Sparsity) in one cycle.
Therefore, each PE is equipped with multiple Task Queues (TQs) guaranteeing enough
concurrency to cache all valid data. As shown in Figure 3·5-(left), in each cycle a PE can
receive up to 4 non-zero elements (sparsity < 75%). Each PE has four task queues to buffer
them.
In each cycle, Read-after-Write (RaW) checker checks RaW hazards for the elements
in the TQs; the arbiter selects a non-empty queue, pops an element without hazard, and
forwards it to the PE for processing. Since the computations are all floating-point, the
latency of pipelined MAC unit is relatively longer, making Read-after-Write(RaW) hazard
a problem. To address this problem, each PE is equipped with a RaW-check-unit and a stall
buffer. If RAW hazard occurs, the new job is cached in the stall buffer until the hazard is
33
resolved.
TDQ-2 (Figure 3·5-right) is used when S is ultra-sparse and stored in CSC format. Since
in CSC the non-zeros are contiguous in a dense array, if we can directly process the dense
array, we gain from avoiding all the zeros. However, we suffer from the overhead of navi-
gating to the correct PE as the indices of neighboring elements are highly scattered. We use
a multi-stage Omega-network for routing the non-zeros to the correct PE according to their
row indices. Each router in the Omega-network has a local buffer in case the buffer of the
next stage is saturated. This design attempts to balance the data forwarding rate and the pro-
cessing capability of the PEs by sending NPE non-zeros per cycle. This is achieved when
non-zero elements are distributed evenly among rows. Compared with a global crossbar
network, the Omega-network design scales better and incurs lower hardware complexity.
When a PE receives a new non-zero pair [d1,d2] from TDQ, it (1) performs the new
multiplication task with d1,d2, (2) fetches the corresponding partial results [d_acc] of
output matrix C from the ACC buffers or the accumulation registers attached to Multiply-
Accumulate-Units (MACs) according to the newly received row index, (3) accumulates the
multiplication result and d_acc, and (4) updates the ACC buffers with the new accumula-
tion result. Each PE is coupled with an ACC buffer to store the rows of C it accounts for.
A PE has two units: an Address-Generation-Unit (AGU) for result address generation and
forwarding, and a MAC attached with multiple accumulation registers. Note that this extra
layer of storage can not only relieve the RaW hazard but also reduce the communication
demand between PEs and ACC Buffers. More details are omitted due to space limitations.
Since C is a dense matrix and stored in dense format, the rows of C are statically parti-
tioned among ACC buffers. Synchronization is only needed when an entire column of the
resulting matrix C is completely calculated.
Overall, for each layer of GCN, we first execute SpMM on X×W . Since X is generally
sparse (except the first layer) and stored in dense format, we use TDQ-1. The result of XW
34
is dense. We then compute A× (XW ) which again is SpMM. However, as A is ultra-sparse
and stored in CSC format, we use TDQ-2. The result is dense, but after ReLU, a large
fraction of the entries become zero, and we again have a sparse matrix as the input feature
matrix for the next layer.
3.2.4 Pipelining SpMM Chains
Intra-Layer SpMM Pipelining: One can exploit the parallelism between consecutive Sp-
MMs (i.e., X ×W and A× (XW )) in a layer through fine-grained pipelining. This is based
on the observation that A is constant for the inference of a certain graph. Once a column
of (XW ) is calculated, we can start the multiplication of this column with A immediately
without waiting for the entire XW (see Figure 3·6). This design has two major benefits: (i)
we gain extra parallelism and reduce the overall latency through this fine-grained pipelin-
ing, and (ii) instead of requiring off-chip storage to cache the big resulting XW matrix, we
only need to buffer a single column of XW ; this can be done on-chip. This method can be
reused within a GCONV layer if (AXW ) is left-multiplied by any other sparse matrices. For
example, some GCNs collect information from 2-hop neighbors so the layer formulation
becomes A× (A× (X ×W )) and the three multiplications can be pipelined and processed
in parallel.
Inter-Layer SpMM Pipelining: SpMMs from different layers can also be pipelined. To
avoid pipeline bubbles and large intermediate buffers, we allocate hardware resources (PEs)
in proportion to the workload of each layer (Figure 3·6). In this way, the output generation
of the previous layer matches the data consumption of the current layer, so that the execu-
tion time of different layers is similar, given optimal workload balance and PE utilization.
Pipelining SpMMs from different layers has two benefits. First, it exploits inter-layer par-
allelism. Second, since A is shared for all GCONV layers in the inference of a particular
graph, it can be reused by SpMM engines across the layers, so off-chip accesses of A are
only required by the first layer. This is done by forwarding elements of A through the
35
Figure 3·6: Pipelined SpMMs: data production and consumption rates
match across consecutive SpMMs by allocating PEs in proportion to work-
load sizes.
layers.
Bandwidth Analysis: Off-chip data access of the big Adjacency matrix A can be a concern.
However, as AWB-GCN always requests and consumes data with continuous addresses, the
off-chip memory bandwidth and the burst mode access can be efficiently utilized. Also, we
use three extra methods to reduce the off-chip bandwidth requirement: (1) as mentioned
above, A is reused across layers; (2) matrix blocking is used to improve the data locality
and reuse of matrix A. Figure 3·7 illustrates how the proposed matrix blocking works
without affecting the efficiency of the rebalancing techniques which are discussed in the
next section. The numbers in the figure are execution orders. A is partitioned into multiple
blocks. Instead of calculating each column of A(XW ) by multiplying all blocks of A and
the corresponding column of (XW ), we calculate t columns of A(XW ) in parallel. The
calculation of a certain block of A will not start until the previous block is reused t times
and finishes calculating its intermediate results of all t columns of the resulting matrix. By
doing so, the data reuse of matrix A is improved by t times. Note that this optimization will
not hurt the efficiency of the autotuning rebalancing of AWB-GCN, as the sub-SpMM of
each block of A is still following column-wise product order. (3) AWB-GCN is equipped
with a scratchpad memory to cache parts of A on-chip as much as possible. For example,
the A and X1 of Cora can be entirely stored on-chip.
Based on our experiments, with the proposed optimizations, the AWB-GCN accelerator
36
Figure 3·7: Matrix Blocking Optimization to reduce the off-chip bandwidth
requirement. The sub-SpMM of each pair of blocks is performed in column-
wise-product order. The numbers represent execution orders.
requires at most 503 Gbps off-chip bandwidth to keep the hardware busy with 1024 PEs
for the 5 datasets evaluated. This bandwidth demand can be generally satisfied by current
platforms (e.g., Intel D5005 FPGA board provides 614 Gbps DDR bandwidth; VCU-128
FPGA provides 3680 Gbps HBM bandwidth; NVIDIA V100 provides 7176 Gbps HBM
bandwidth).
3.2.5 The Workload Balance Problem
The baseline architecture works well when non-zeros are evenly distributed among the
rows of A. However, when this assumption does not hold, the performance of the baseline
architecture can degrade considerably due to workload imbalance among PEs. Figures 3·8
illustrates the utilization of 256 PEs processing SpMMs with Adjacency matrices of the
Citeseer and NELL datasets. As mentioned in Section 3.1, evil rows and regionally clus-
tered non-zeros in power-law graph matrices bring the inefficiency. The existence of evil
rows keeps only a few PEs busy while all others idle most of the time, resulting in signifi-
cant major crests in the utilization waves; the regionally clustered non-zero elements result
in the minor crests; the differences in the numbers of non-zeros in neighboring rows result
in other fluctuations.
A common software approach for dealing with sparse data structures is to profile the
37




























Figure 3·8: PE utilization waves of 256-PE Baseline SpMM engine pro-
cessing A× (XW ) of Nell and Citeseer.
structure, e.g., with symbolic analysis, and then use that information to guide the “real”
processing. For GCNs, however, it has been demonstrated that the preprocessing stage
can take 10× more time than the inference itself [Yan et al., 2020]. In this work, we
dynamically adjust hardware configurations for workload rebalancing. This design can be
applied to a variety of specialized accelerators for processing sparse data structures.
3.3 AWB-GCN Architecture
In this section, we describe the AWB-GCN architecture. The core is the handling of load
balancing at three levels of granularity: distribution smoothing for local utilization fluctua-
tions among PEs, remote switching for the minor crests, and row remapping for the major
crests.
Figure 3·9 illustrates autotuning with 24 PEs performing SpMM on a power-law matrix.
The gray bars at the top show the execution time of parallel PEs; the length changes dynam-
ically through the process. The narrower bars at the bottom show the static density-per-row
of the matrix. Ideally, at the end of autotuning, all bars on the top becomes short and have
the same length. Each round of autotuning includes two phases: First, data processing and
distribution smoothing in phase 1; then remote switching and row remapping.
38
Figure 3·9 (a)&(b) illustrate the first round of autotuning. The progression from Figure
(a) to (b) shows the first phase. Figure (a) gives estimated execution time without distri-
bution smoothing; Figure (b) shows the actual execution time with distribution smoothing
applied. During phase 1, PEs keep offloading workloads to their less busier neighbors,
resulting in a more flat and smooth execution time wave (shown in (b)). Meanwhile, the
execution time of PEs at the wave crests and troughs is recorded by the Autotuner.
After all the PEs have finished, phase 2 starts. The Autotuner partitions and remaps
evil rows to PEs at troughs and switches workloads of the PEs at the minor crests with the
ones at the troughs. The green and blue arrows in (b) show evil row remapping and remote
switching decisions, respectively. After these decisions are made, the second round of
autotuning starts (Figures (c)&(d)). With remote switching and row remapping determined
in the first round, the initial workload distribution among PEs at the start of the second
round (shown in (c)) can be more efficiently balanced by distribution smoothing (shown in
(d)). The blue arrows in (d) show that remote balancing not only finds new pairs of PEs to
switch workloads, but also adjusts the switch fractions determined in the previous round.
After several rounds, the system converges to optimal balanced status; this is then used for
the remainder of the computation.
All profiling and adjustment are performed at runtime. We now present design details.
3.3.1 Distribution Smoothing
At the start of processing, rows are evenly distributed among PEs as introduced in Section
3.2.2 (as shown in Figure 3·4(B)). During the calculation of each round, we employ dis-
tribution smoothing by averaging out the workloads among neighbors. The architecture is
able to monitor the runtime PE utilization information by tracking the number of pending
tasks in TQs and keep offloading the work of PEs with more pending tasks to their less
busy neighbors. However, the offloaded work needs to be sent back to the ACC buffers of







































e of parallel PEs
0
Phase1: Distribution smoothing (1st round)


















 Balanced PE workload
 High PE utilization
 Less execution time
Autotuning
 converge to optimal balance
















Figure 3·9: Rebalancing process: distribution smoothing, remote switching
and row remapping per round. (a)&(b): 1st round; (c)&(d): 2nd round.
may offload workloads among direct neighbors, 2-hop neighbors, or even 3-hop neighbors,
but not farther ones.
Figure 3·10 illustrates the hardware design of 1-hop distribution smoothing for TDQ-1
and TDQ-2.
TDQ-1: Before a new task is pushed into the TQ of a PE, the PE compares the number of
pending tasks with those in the neighboring TQs. The task is then forwarded to the TQ with
the fewest pending tasks. If forwarded to a neighbor, the result needs to be returned to the
ACC buffers of its original PE after accumulation (see Figure 3·10-(B)). Note that in order
40
to match the computation rates and the data access concurrency, the rows assigned to each
PE are distributively mapped and stored in the ACC buffers of this PE and its neighbors in
an interleaved manner. The calculation of valid return address and accumulation of partial
results are done in the neighbor PE.
TDQ-2: The final layer of the multi-stage Omega network handles neighbor task forward-
ing. As shown in Figure 3·10-(C) (also in Figure 3·11), multiple PEs share the same final-
layer switch; we refer to these PEs as a group. AWB-GCN keeps tracking the usage of TQs
of the final layer. Once a new task is forwarded to the final-layer switch, the TQ usages
among neighbors are compared and then the task is routed to the PE with the lowest TQ
usage. To enable PEs on the group edge (i.e., the leftmost or rightmost PEs per group)
to communicate with their out-of-group neighbors, we augment the Omega-network by
adding 2 extra links per switch in the final layer, as shown in Figure 3·10-(D). Note that
Figure 3·10-(D) shows sharing only among 1-hop neighbors. By considering more distant
hop neighbors, a more balanced design is obtained at the cost of higher hardware complex-
ity and area. This is discussed in the evaluation section.
Distribution smoothing helps remove local utilization fluctuations (Figures 3·9(a) to
(b)), but is not sufficient when (1) non-zeros are clustered in a region across many PEs, so
that neighbors are mostly busy and have no chance to help each other, resulting in a minor
utilization crests (PE20,21,22 in Figure 3·9(b)); or (2) most non-zeros are clustered in only
a few rows so that the major crests cannot be eliminated even if all neighboring PEs help
(PE9 in Figure 3·9(b)).
3.3.2 Remote Switching
To address regional clustering, we propose remote switching. This process partially or
completely exchanges the workloads between under- and overloaded PEs, i.e., at centers
of utilization wave troughs and crests, respectively. The switch fraction is determined at
41
Figure 3·10: Simplified architecture of distribution smoothing.
runtime by an autotuner and is based on per-round PE utilization. As the sparse matrix A
is reused during the processing per round, the switch strategy generated in prior rounds is
valuable in the processing of later rounds. The accelerator remembers the switch strate-
gies used in the current round and incrementally optimizes them based on the utilization
information obtained in the next round. In this way, remote switching is able to flatten the
crests and troughs; after several rounds of autotuning, the switch strategy best matching the
sparse structure of A is obtained, and is used for the remaining rounds for almost perfect
PE utilization.
The hardware design is shown in Figure 3·11. The over-loaded and under-loaded PEs
are identified by using the PE Status Monitor (PESM) of Autotuner during Phase one of
42
Figure 3·11: Overall architecture of SpMM engine in AWB-GCN with
three rebalancing techniques: distribution smoothing, remote switching (red
bordered) and evil row remapping (purple bordered). Here every 128 PEs
has one Super-PEs and four Labor-PEs. These numbers can be customized.
autotuning. Recall that each TQ has a counter to track the number of pending tasks; these
can trigger an empty signal when reaching zero. These empty signals are connected to the
PESM. At each cycle, the updated empty signals are xored with their values recorded on the
previous cycle. The XOR results indicate which PEs are newly finished; this information
is stored in the Switch Candidate Buffer.
At the start of each round (after A is totally sent to TDQs), the arbiter scans the buffer
and record IDs of newly done PEs until enough under-loaded PEs have been found. The
43
number of PE tuples for switching at each round can be customized. In Figure 3·11, four
tuples of the most over- and under-loaded PEs are selected for remote switching. After
the arbiter finds the first 4 idle PEs, it stops scanning the buffer and instead waits for the
completion signal (bit −AND all empty signals) from the system, which implies all PEs
have become idle. Meanwhile, the Switch Candidate Buffer caches the newly idle info of
the most recent cycles. Whenever the arbiter receives the completion signal it starts to scan
the buffer and continues until the four most over-loaded PEs have been found. Note that the
arbiter does not select neighbor PEs continuously; this guarantees that PE tuples selected
by PESM are at different crests and troughs of the utilization wave.
To avoid thrashing, we only exchange a portion of the workload between PEs. We use
the following equation to calculate the number of jobs (i.e., rows of A) to be switched in
the i-th round (i.e., a column of B), Ni_init:
Ni_init = Gi/G1× (R/2) (3.2)
where Gi is the workload gap of the selected PE tuple at the i-th round, and R is the number
of rows per PE under equal mapping. Here, workload gap is approximated as the difference
of execution cycles to finish all tasks.
In the i+ 1-th round, new PE-tuples are selected and their switch fractions are cal-
culated. Meanwhile, the autotuner also tracks the post-switching utilization gaps of PE-
tuples selected in the prior rounds and uses them as feedback to adjust the switch fraction
Ni_init; this minimizes the utilization gaps further. The workload switching fraction for
each tracked PE-tuple is adjusted for two or more rounds and is highly likely to converge





Gi/G1× (R/2) i f j = 0
N(i−1),( j−1)+Gi/G1× (R/2) i f j > 0
(3.3)
44
where j denotes the number of rounds of fraction update. Ni, j indicates that the current
PE-tuple is in its j-th update and its initial fraction to switch was calculated in the i− j-
th round. The number of rounds tracked simultaneously can be customized and depends
on the size of the tracking window in the PESM; this is an area/performance tradeoff. In
Figure 3·11, two consecutive rounds are tracked.
Calculation of Equation 3.3 is done in the Utilization Gap Tracker (UGT in Fig-
ure 3·11). To reduce the hardware cost of calculating Gi/G1× (R/2), we use a hardware-
friendly approximation with threshold-based counting and table lookup; when the most
under-loaded PE is found, the left CNTs in UGT start counting. The execution cycle gap
(G1) at the first round is right-shifted by g bits (the granularity for division approximation).
The result is used as a threshold. Whenever the left CNTs reach the threshold, they get
back to 0 and the right CNT adds 1. When the most over-loaded PE is found, the counting
stops. Assuming the right CNT counts to q, we know the execution time gap at the current
round is approximately q×G1/(2g).
Using q as the address to access the Table for Switch Fraction Lookup (TfSFL), we
know the approximate number of rows that needs to be switched. Once the number of rows
to be switched is known, it is forwarded to the Workload Distribution Controller (WDC)
together with the corresponding PE IDs. At the start of the next round, the destination PE
of these rows is updated in the Shuffle Switches (SS). By doing so, the non-zeros in these
rows will be forwarded to the post-switching PEs in the coming rounds.
Furthermore, in order to reduce the workloads of overloaded PEs more efficiently, all
operations related to the main-diagonal elements at the rows assigned to these PEs are
skipped during processing. Instead of performing these operations, when the required ele-
ments of the dense matrix reach TDQs, they will be directly forwarded to the ACC Buffers
of the post-switching PEs and be accumulated just before the final accumulation results are
sent to the next kernel.
45
Remote switching followed by distribution smoothing is efficient on getting rids of most
of crests of utilization waves. However, for the major crests resulted from evil rows which
have too many non-zeros to be shared only by neighbors, extra effort is required.
3.3.3 Evil Row Remapping
We address evil-row clustering by building row remapping support into the remote switch-
ing hardware. With row remapping, the evil row is distributed to the most under-loaded PEs
in troughs; in this way the neighbors of these PEs can help. Row remapping is triggered
based on demand at the end of each round. The autotuner calculates the utilization gaps
between the most over- and under-loaded PEs and determines whether their gaps are too
big for remote switching to handle. If yes, row remapping is performed. The workloads
of the PE overloaded in the current round are switched (temporarily) with a Super-PE in
the next round. During processing of the next round, the Super-PE counts the numbers
of non-zeros per row and finds the evil rows containing the most non-zeros. In the round
after, the workloads of each evil row are partitioned and distributed to a set of Labor-PEs
controlled by the Super-PE.
After evil rows are remapped to labor-PEs, the original workloads of the labor-PEs
can still be swapped with the most under-loaded PEs via remote switching; this ensures
that even if the labor-PEs are overloaded originally, they do not become new crests after
row remapping. If a labor-PE itself is found to have an evil row, evil row remapping will
first map its workload to the master-PE, and then distributively remap the evil row back
to labor-PEs, including the one which has the evil row originally. By remapping evil rows
statically to certain PEs instead of dynamically to random ones, the aggregation of partial
results becomes hardware efficient. If row remapping is not triggered, Super- and Labor-
PEs serve as regular PEs.
The existence of evil rows is generally the most critical bottleneck, especially when
utilization is lower than 50%. The proposed row remapping technique makes it possible
46
for the autotuner to find the optimal workload distributions and achieve high utilization. As
evil row remapping is normally triggered during the first few rounds, the utilization of the
system increases rapidly right at the start and the autotuner generally converges quickly.
Figure 3·11 illustrates the hardware support of row remapping. For clarity, only one
Super-PE and its four Labor-PEs are shown. The Labor-PE has an architecture similar to
the normal PE, but they are equipped with an extra register array to store the accumulation
results of evil rows and are connected to an adder tree for result aggregation. The aggre-
gated results of evil rows are cached in a small separate ACC buffer. The super-PE is much
bigger than other PEs, as it serves as a profiler to find the evil rows. It is equipped with two
extra modules: a parallel sorting circuit that tracks the rows with the most non-zeros; and
a non-zero counter (including a local buffer) that records the number of non-zeros per row.
Workload remapping between Super-PE & Labor-PEs and workload switching between
Super-PE & the PE-with-evil-rows are handled by augmenting the Autotuner as follows.
First, the UGT module is equipped with a comparator to identify whether evil row remap-
ping is required; if it does, then the UGT will send the information to WDC. The WDC
knows the IDs of the Super-PE and Labor-PEs. If row remapping is triggered or an evil row
is found, the entries of the Super- and Labor-PE at Distribution Switch Table in the WDC
are updated. This enables workload switching and remapping in the coming round.
3.4 Evaluation
In this section, we evaluate AWB-GCNs with different design choices and compare them
with other platforms processing the same networks.
3.4.1 Evaluation Configuration
We implement AWB-GCNs in Verilog HDL and measure PE utilization, performance,
energy efficiency, and hardware resource consumption on Intel acceleration card D5005
47
which is equipped with a Stratix 10 SX FPGA. Note that the FPGA is only used as an
evaluation platform to demonstrate the performance of AWB-GCN. The design is a general
architecture that does not leverage any FPGA-specific features.
To measure utilization, we add a counter to each PE to track the number of idle cy-
cles. The number of operating cycles (latency) is measured by running GCN inference
continuously for 1000× and calculating the average results. The hardware consumption
and operating frequency are reported by Quartus Pro 19.4 after synthesis and implementa-
tion. To perform fair cross-platform comparisons, we implement GCNs with PyG [Fey and
Lenssen, 2019], and run them on Intel Xeon E5-2680-V3 CPU and NVIDIA RTX 8000
GPU. We also compare AWB-GCN with prior work on GCNs such as HyGCN [Yan et al.,
2020].
The datasets used for evaluation are Cora, Citeseer, Pubmed, Nell and Reddit; these are
the five most widely used publicly available datasets in GCN research.
3.4.2 AWB-GCN Evaluation
Design efficiency is evaluated by comparing the performance, hardware resource consump-
tion, and PE utilization of the 1K-PE baseline design without any rebalancing techniques
(i.e., Baseline) with the four different design choices of 1K-PE AWB-GCNs: (i) 1-hop dis-
tribution smoothing (i.e., Design(A)), (ii) 2-hop distribution smoothing (i.e., Design(B)),
(iii) 1-hop distribution smoothing plus remote switching and row remapping (i.e., De-
sign(C)), and (iv) 2-hop distribution smoothing plus remote switching and row remapping
(i.e., Design(D)). The only exception is for Nell where we use 2-hop and 3-hop distribution
smoothing (rather than 1-hop and 2-hop) due to its extremely clustered distribution.
Figure 3·12 compares the average GCN inference latency and utilization of PEs for the
five designs over the five datasets. The lines show the overall PE utilization. The bars show
the breakdown of execution cycles of different GCN layers. The latency of ReLU is too
low to show in the figure. The off-chip memory access latency is overlapped with computa-
48
Figure 3·12: Overall performance and PE utilization of 1K-PE AWB-GCN
with five design choices.
tion. We also mark the latency lower bound assuming theoretically ideal PE utilization. For
Cora, Citeseer, Pubmed, Nell and Reddit, comparing to Baseline, Design(B) can improve
PE utilization from 38%, 56%, 44%, 7.1% and 82%, to 79%, 77%, 86%, 39%, and 99%,
respectively, leading to 1.94×, 1.25×, 1.56×, 5.93×, and 1.19× performance improve-
ment. Enabling remote switching can further improve PE utilization to 88%, 88%, 93%,
88%, and 99%, bringing performance gain to 2.11×, 1.41×, 1.62×, 8.75×, and 1.20×.
The results show that AWB-GCN always provides high utilization and close to theoretical
peak performance for datasets with various levels of power-law distribution.
In AWB-GCN, hardware resources allocated to different layers are in proportion to their
49
Figure 3·13: Per-SpMM performance and PE utilization of 1K-PE AWB-
GCN with five design choices.
volume of operations. Thus, when perfect utilization is achieved, the same execution delay
is observed for all layers. As shown in Figure 3·12, the green and red bars have similar
lengths at Design(D), while their lengths vary significantly for the Baseline.
The shaded area in Figure 3·12 represents the performance overhead of the proposed
rebalancing techniques. Distribution smoothing is performed during the processing of PEs
incurring no overhead so Designs(A)&(B) are not shaded. For Designs(C)&(D), most of
the tasks for remote switching and row remapping are also performed in parallel with the
processing of PEs, e.g., all tasks at PESM and the utilization gap calculation at UGT. How-
ever, the table lookup for switch fraction at UGT and data update at WDC must be done se-
quentially between the processing of two consecutive iterations (columns). They introduce
negligible overheads (shaded area of bars for Design(C)&(D)), before the system converges
50
Remote Switching + Row RemappingDistribution Smoothing
OthersOmega/Shue network Task QueueMAC in PE
























































































Figure 3·14: (A) Hardware resource consumption normalized to the num-
ber of ALMs and (B) On-chip storage demand of 1K-PE AWB-GCN.
to optimal balanced status. The shaded areas are only visible for Cora and Citeseer whose
workloads are relatively lighter.
Figure 3·13 further breaks down the numbers of execution cycles and shows results for
every SpMM kernel; this demonstrates the benefits of AWB-GCN on kernels with various
sparsity, size and distributions. The shaded area of the bars represents the Sync cycles due
to workload imbalance; the unshaded area represents the Ideal cycles assuming perfect
workload balance. The bars in different colors represent the execution cycles of the four
SpMM kernels in the two-layer GCNs [Kipf and Welling, 2016, Wu et al., 2020b]: A×
(XW ) and X ×W at Layer 1 and 2. The lines show the corresponding PE utilizations. As
51
Figure 3·15: AWB-GCN PE (1K) average utilization per round of workload
autotuning.
shown in Figure 3·13, Design(D) significantly minimizes the synchronization overheads
for all kernels of the 5 GCN models.
Comparing SpMM kernels, utilization improves significantly for A× (XW ) at both lay-
ers and X×W at Layer-1. As for X×W at Layer-2, although X is also sparse after activation
is performed, its sparsity is much lower than that of the X at Layer-1 and its non-zero dis-
tribution does not follow the power-law (similar to that of the sparse matrices in SCNNs);
utilization is thus high even with the baseline design.
Figure 3·14(A) compares the hardware resource usage of the five designs over the five
datasets. To show comparable breakdowns to ASIC implementations, the results of hard-
ware resource usage are normalized to the number of Adaptive Logic Modules (ALMs).
ALM is the basic component of Intel FPGAs. The blue segments represent the resource
usage for the modules of the baseline design including MAC units in PEs, task queue con-
trol logic, omega/shuffle networks, and other modules in the baseline design. The green
and red segments refer to the hardware overheads for the support of distribution smoothing
52
Figure 3·16: Scalability evaluation: PE utilization and overall performance
of Baseline, Design(B) and Design(D) of AWB-GCNs with 512, 1K, 2K
and 4K PEs.
and remote switching + row remapping. Note that in practical FPGA implementations,
MAC units in PEs are instantiated with floating-point DSP slices. In order to show the area
breakdown more clearly, we normalize the DSP slices to ALMs. As shown in Figure (A),
the overheads of 1-hop and 2-hop distribution smoothing are on average 3.3% and 7.0%,
respectively; the overhead of remote switching and row remapping is, on average 1.5%.
Figure 3·14(B) compares on-chip storage demand. That of Task Queues in the Omega-
Network is in blue; the buffers for remote switching + row remapping are in red; the others
are in green. As shown in Figure (B), the overall storage demands of AWB-GCN with
Design(D) are even lower than the baseline. This is largely due to dramatically reduced
per-PE Task Queue size under more balanced workloads. With much more balanced work-
load distributions in Design(D), the congestion and backpressure in Omega-Network are
53
significantly relieved, making the TQs narrower and shallower.
Finally, Figure 3·15 shows the utilization improvement due to iterative workload auto-
tuning. Rebalancing can be accomplished within 10 iterations. This means that most of the
iterations can benefit from operating under the converged optimal strategy. Note that the
utilization of Nell has a sharp improvement in round 3 due to effective evil row remapping.
3.4.3 Scalability of AWB-GCN
We evaluate the scalability of AWB-GCN by running GCN inference of the five datasets
on the baseline as well as Designs (B) and (D) of AWB-GCN and varying the number of
PEs from 512, 1024, 2048 to 4096. In Figure 3·16, the bars represent the performance
speedup comparing with the baseline design with 512 PEs. The lines represent average PE
utilizations.
As shown in Figure 3·16, the PE utilization of the baseline design drops dramatically
with increasing number of PEs. This is because more PEs means fewer rows per PE,
highlighting the imbalance among PEs: they have fewer opportunities to absorb inter-row
imbalance. Due to the dropping PE utilization, the performance speedup shows poor scala-
bility. For AWB-GCN with only distribution smoothing, PE utilization also drops but more
slowly than baseline. Nell is an outlier as the utilization of baseline with 512 PEs is too low
to drop. In contrast, the PE utilization of the complete version of AWB-GCN, Design(D),
is high and stable. The performance scales almost linearly with increasing number of PEs.
3.4.4 Cross-platform Comparison
We evaluate five scenarios: (i) AWB-GCN Design-(D) with 4096 PEs, (ii) PyG-based im-
plementation on Intel Xeon E5-2680v3 CPU (PyG-CPU), (iii) PyG-based implementation
on a NVIDIA RTX 8000 GPU (PyG-GPU), (iv) 4096-PE baseline AWB-GCN without
workload rebalancing, and (v) SCNN [Parashar et al., 2017] reproduction with 4096 mul-
tipliers (we build a system-C-based cycle-accurate simulator for SCNN). We use the five
54
datasets: Cora, Citeseer, Pubmed, Nell, and Reddit for the evaluation. The GCN model
configuration follows the original GCN algorithm papers [Kipf and Welling, 2016,Zhuang
and Ma, 2018,Chen et al., 2018]. Note that we use half-precision floating point for Reddit.
We label these GCNs “Standard_networks”.
As shown in Table 3.2, despite running at a relatively low frequency, AWB-GCN
achieves, on average, speedups of 2622× and 136× over the well-optimized PyG imple-
mentations on high-end CPUs and GPUs. It achieves a speedup of 6.1× over the baseline
design without workload rebalancing. For the Nell dataset, the speedup over the baseline is
18.8×, demonstrating in particular the impact of workload balancing. Reddit fails on GPU
due to out-of-memory. AWB-GCN achieves from 2.3× to 19.7× speedup compared with
SCNN. SCNN is inefficient when working with GCNs because it uses Cartesian Product-
based SpMM which requires massive and highly irregular reduction of intermediate results,
especially when the matrices are very big, sparse, and follow power-law. SCNN also re-
quires very high off-chip bandwidth for the reduction. In our evaluation, we assume SCNN
is equipped with a High Bandwidth Memory (HBM) which provides sufficient off-chip
bandwidth. If DRAMs are used, the performance of SCNN would be even lower.
The tremendous speedups over PyG-CPU and PyG-GPU originate from AWB-GCN’s
dedicated architecture which uses features not available on general-purpose CPUs and
GPUs: (a) the dynamic autotuning techniques ensure balanced workload and high PE uti-
lization (Section 3.3); (b) all the SpMM kernels of the GCN layers are deeply pipelined,
leading to reduced on-chip storage demand (Figure 3·6); (c) inter-layer data forwarding and
matrix blocking (with column-wise-product sub-SpMM execution Figure 3·7) improve data
reuse and guarantee that off-chip memory accesses are to consecutive addresses.
We also compare AWB-GCN with existing GCN accelerators. Prior to this work, to the
best of our knowledge, the design proposed by Yan et al., HyGCN [Yan et al., 2020], is
the only reported accelerator of GCN inference. However, HyGCN customizes the hidden
55
Table 3.2: Comparison with CPU, GPU and Baseline processing Stan-
dard_networks. OoM: Out of Memory. Units of Latency and Energy ef-
ficiency are ms and graph/kJ.
Platform Standard_
networks



































































Freq: 330MHz Energy Efficiency 3.08E6 1.93E6 2.48E5 4.12E3 2.09E2
layer of all GCN models to 128 channels, which is distinct from the original settings [Kipf
and Welling, 2016, Zhuang and Ma, 2018, Chen et al., 2018]. We refer to the HyGCN-
customized models as HyGCN_networks. Also, the HyGCN report does not give absolute
performance but, rather, relative speedups over a E5-2680v3 CPU. To compare AWB-GCN
with HyGCN, we realize the HyGCN_networks on the same E5-2680v3 CPU, adopting the
same software framework (PyG [Fey and Lenssen, 2019] – HyGCN also uses PyG for the
testing on CPU). The PyG-CPU result is thus a common baseline for comparing the relative
speedups. Table 3.3 shows the results.
With the HyGCN_networks, AWB-GCN achieves on average 3888×, 25.3×, and 5.1×
speedups over PyG-CPU, PyG-GPU, and the HyGCN design, respectively. The perfor-
mance improvement is attributable, in part, to the features of AWB-GCN as discussed,
and one additional reason: HyGCN scheduling is coarse-grained block-wise, while that of
AWB-GCN is fine-grained element-wise. This avoids redundant data access and results in
more benefit from a balanced workload.
56
Table 3.3: Comparison with the prior art, HyGCN, processing
HyGCN_networks customized in HyGCN paper [Yan et al., 2020]. Units of
Latency and Energy efficiency are ms and graph/kJ.
Platform HyGCN_
networks




















































Freq: 330MHz Energy Efficiency 4.39E5 2.71E5 3.17E4 2.28E3 1.45E2
As this design is implemented on an FPGA, it is difficult to compare energy efficiency
to HyGCN, which is implemented as an ASIC. Comparing to ASIC, the reconfigurable
routing switches on FPGA chip consume extra energy, making the energy efficiency of
FPGA design lower (approximately 14× according to the numbers reported by Kuon [Kuon
and Rose, 2007]). To compare the performance fairly, we limit the number of multipliers
used and make it comparable to HyGCN. In particular, we use 4k 32-bit floating-point
multipliers and HyGCN uses 4608 32-bit fixed-point multipliers. Floating-point multipliers
also consume more energy than fixed-point ones.
3.5 Related Work
GNN studies use neural network algorithms to address problems in graph processing. The
first GNN model was proposed by Gori et al. [Gori et al., 2005]. In the past decade, work
has continued on optimizing GNN algorithms exploring new neural network approaches
[Dai et al., 2018,Wu et al., 2020b,Micheli, 2009,Scarselli et al., 2008,You et al., 2018,Abu-
El-Haija et al., 2018, Gao et al., 2018b]. More recently, inspired by CNNs that achieve
57
great success with euclidean data, GCNs are proposed for hidden feature extraction of non-
euclidean data. In 2013, Bruna et al. [Bruna et al., 2014] proposed the first GCNs for
spectral graph theory; this was developed further in a number of variants [Henaff et al.,
2015, Defferrard et al., 2016, Kipf and Welling, 2016]. GCNs are at the center of the
research on neural-network-based graph processing [Yun et al., 2019].
There have been many efforts on accelerating sparse CNNs [Kim et al., 2017, Zhang
et al., 2016b,Albericio et al., 2016,Han et al., 2016,Parashar et al., 2017,Kung et al., 2019,
Chen et al., 2017, Ding et al., 2017]. We summarize them and explain why they fall short
when applied to GCNs. Kung et al. condense the sparse parameter matrix through column
grouping [Kung et al., 2019]. In case of conflict, only the most significant parameters are
kept, others are discarded. Essentially, some accuracy is sacrificed for performance. Kim
et al. [Kim et al., 2017] address the workload imbalance problem of sparse CNNs, but use
information from design-time profiling and pre-scanning. Han et al. [Han et al., 2016]
propose EIE, an SpMV accelerator that addresses imbalance with row-direction queuing.
The design is not feasible in GCNs due to their large data size and power-law distribution.
In EIE, weight matrices of SCNNs are distributively pre-stored on-chip in local buffers of
PEs. This avoids off-chip non-zero accesses and online workload distribution, but is not
possible for GCNs. Also, single-direction queuing fails to balance the workload of power-
law matrices, which have serious imbalance on both directions. Zhang et al. [Zhang et al.,
2016b] propose Cambricon-S with efficient index matching to identify and multiplex non-
zeros and feed them to massively parallel PEs. Again, these proposed architectures are
not feasible for processing GCNs due to the ultra-low sparsity of power-law graphs which
leads to highly scattered indices of neighboring elements. Given the adjacency matrix of
Nell and a 1024-PE Cambricon-S, multiplexing enough non-zero pairs to feed all PEs per
cycle would require 1024× 13699:1 multiplexers for single-precision floating point; this is
not viable given likely chip technology.
58
Besides work on sparse CNNs, researchers also propose architectures for general
SpMM. Zhuo and Prasanna [Zhuo and Prasanna, 2005] present an SPMV design for FP-
GAs. Pal [Pal et al., 2018] proposes an outer-product-based SpMM architecture. This
work focuses on reducing redundant memory accesses to non-zeros and does not essen-
tially address the ultra-workload-imbalanced issue faced with GCNs. In their results, load-
imbalances during the merge phase and the uneven data sharing patterns during the multiply
phase lead to degraded speedup for the dataset with highly-unbalanced non-zero element
distribution.
SIGMA [Qin et al., 2020] and ALRESCHA [Asgari et al., 2020] are recent high-
performance architectures for SpMM and SpMV. We mainly discuss SIGMA, as SIGMA
focuses on SpMM kernels and is equipped with more efficient optimizations for SpMM,
while ALRESCHA has higher flexibility to support various kernels through switch recon-
figuration. SIGMA uses an element-wise smart global controller to distribute every pair
of non-zeros to the proper PEs dynamically through a Benes network. By doing so, PEs
work with high utilization and the operations are evenly distributed among all multipliers
so that workload imbalance is eliminated. SIGMA is highly efficient for general SpMMs,
but needs some augmentation to work with GCNs. First, for a very large and sparse matrix,
the type of bitmap compression format introduces significant overhead. Second, similar
to Cambricon-S, the multiplexer required to get source/destination pairs would become
very large, which limits the performance significantly. Third, for the extremely large and
sparse matrices common in GCN usage, the efficiency of the element-wise global con-
troller decreases significantly when performing tasks such as matrix scanning and element
filtering/counting which determines the number of Flex-DPEs.
To eliminate workload imbalance without using a global element-wise controller (as
used in SIGMA), AWB-GCN uses auto-tuning-based rebalancing hardware, which is es-
sentially also a “controller”, to dynamically distribute tasks. In contrast to the controller
59
used in SIGMA, AWB-GCN’s is more coarse-grained and lighter weight. In particular,
the distribution smoothing function is a local element-wise controller which, similarly to
SIGMA, distributes non-zero pairs to proper PEs. However, in contrast to SIGMA, in
AWB-GCN the destination PEs must be local, meaning that they must within a few hops
of the PE assigned in the initial mapping. Also, remote switching+row remapping is ef-
fectively a global row-wise controller which can distribute tasks to any proper PEs without
range limit, rather, with granularity of rows/fraction of rows instead of elements. To make
the proposed hybrid and light-weight controller handle workload imbalance as well as a
global element-wise controller, we use auto-tuning. The proposed auto-tuning-based con-
troller is designed especially for SpMMs with power-law matrices. For general SpMMs, a
global element-wise controller can be more efficient.
Another active area of research is graph processing. Song et al. [Song et al., 2018]
and Zhang et al. [Zhang et al., 2018] propose GraphR and GraphP, which are both based
on Processing In Memory (PIM), to accelerate low-precision graph tasks. However, they
do not support complex floating-point operations. Ham et al. [Ham et al., 2016] propose
Graphicionado, a vertex-centric acceleration framework for graph analytics applications.
It focuses on simple graph analysis applications. Ozdal et al. [Ozdal et al., 2016] propose
a System-C based template for graph analytics applications. Ozdal’s work and Graphi-
cionado both use crossbars for data exchange which limits their scalability. None of these
can directly support GCNs without significant modifications.
Researchers also conduct software optimizations for SpMM on GPUs and general-
purpose multicore CPUs [Greathouse and Daga, 2014, Liu and Vinter, 2014, Ashari et al.,
2014,Bell and Garland, 2008,Bell and Garland, 2009]. These software solutions, however,
do not meet the strict timing requirements of GCNs because of significant overhead in pre-
scanning [Greathouse and Daga, 2014, Liu and Vinter, 2014, Ashari et al., 2014, Yan et al.,
2020] which is avoided in AWB-GCN. Also, adjacency matrices evolve at runtime, making
60
offline processing even less useful.
3.6 Conclusion
In this Chapter, we propose AWB-GCN to accelerate GCN inference. To tackle the ma-
jor performance issues derived from DSL irregularity and workload imbalance, we pro-
pose a hardware-based workload distribution autotuning framework including three run-
time workload rebalancing techniques: distribution smoothing, remote switching, and row
remapping. The proposed rebalancing methods rely on hardware flexibility to realize per-
formance autotuning with negligible area and delay overhead. This is the first architecture
that relies on hardware autotuning to achieve workload rebalancing for sparse matrix com-
putations. We evaluate AWB-GCN using an Intel FPGA D5005 Accelerator Card with 5
widely used GCN datasets. Results show that AWB-GCN can achieve, on average, 3255×,
80.3×, and 5.1× speedups over high-end CPUs, GPUs, and other prior work respectively.
Although AWB-GCN is designed for GNNs [Hamilton et al., 2017, Xu et al., 2018, Yun
et al., 2019], it is generally efficient to handle the DSL irregularities in all types of NNs
whose major computation kernel is SpMM.
61
Chapter 4
LP-BNN: Ultra-low-Latency BNN Inference
with Layer Parallelism
This chapter introduces LP-BNN (Layer-Parallelism-based BNN), a preliminary work of
O3BNN-R architecture which is discussed in Chapter 5. LP-BNN is a high-performance
accelerator of BNNs. Different from O3BNN-R which is mainly proposed for embedded
FPGAs, LP-BNN can also be efficiently adopted on high-performance and large-scale FP-
GAs. However, unlike O3BNN-R, LP-BNN does not support the detection and pruning of
superfluous operations. Our experience on the implementation of LP-BNN builds a good
foundation for the following design of O3BNN-R. This chapter is based on the work pub-
lished in the 30th International Conference on Application-specific Systems, Architectures
and Processors (ASAP) ©2019 IEEE [Geng et al., 2019a].
4.1 Introduction
The past decade has witnessed the emergence and widespread adoption of Deep Neural
Networks (DNNs), not only in image processing and speech recognition but also in High
Performance Computing domains such as extreme big-data processing for in-situ analysis
[Geng et al., 2018,Geng et al., 2016,Geng et al., 2018a]. To continue to improve prediction
accuracy, DNNs are becoming ever deeper and more complex, leading to increasingly long
processing latency and high resource demands. A large number of accelerator designs
have been proposed for DNN inference [Geng et al., 2018]. However, these designs have
achieved only limited benefit in real-time domains with strict latency constraints such as
62
autonomous driving and robotic control.
Since DNNs can often tolerate some inaccuracy, researchers have begun to explore re-
duced bit-width for DNN training and inference [Tang et al., 2017] [Zhou et al., 2016]
[Geng et al., 2019b]; Binarized-Neural-Networks (BNNs) [Rastegari et al., 2016] in par-
ticular have gained much attention. BNNs use a single bit to encode each neuron and pa-
rameter, thus significantly reducing computational complexity and memory demand, and
potentially reducing inference latency by orders-of-magnitude. BNNs map particularly
well to FPGAs whose configurability enables millions of one-bit ALUs to be implemented
on a single device [Wang et al., 2020] [George et al., 2016]. In contrast, a recent report
from Intel [Nurvitadhi et al., 2016] indicates that only 10% and 7% peak performance can
be achieved when running a BNN on a Xeon CPU and a Titan X GPU, respectively (with a
batch size of 10).
Despite the attractiveness of mapping BNNs to FPGAs, creating an efficient design that
minimizes the inference latency of large networks with big data inputs still poses a great
challenge. To the best of our knowledge, the shortest reported inference latency of AlexNet
inference is around 1.16ms [Liang et al., 2018], which is still far from ideal for real-time
scenarios. This delay is mainly due to three factors:
1) The critical Normalization Layer (NL) [Rastegari et al., 2016] [Courbariaux et al., 2016]
uses full-precision floating point operations, i.e., two FP MUL/DIV and three FP AD-
D/SUB. These operations and their parameters incur significant latency along with large
storage demands compared with the other sub-layers. Although some researchers tried to
simplify NL function by using threshold-based comparison, they did not apply their ap-
proach to large scale datasets such as ImageNet and their approach is not useful for the
networks with more complex structure, e.g., ResNet [Umuroglu et al., 2017].
2) Previous designs accelerate BNNs by exploiting data parallelism with layers processed
sequentially. Hence, the overall latency is the accumulated latency of each layer, plus the
63
communication and reconfiguration overheads. As a result, latency expands with network
depth. And since a layer cannot start processing until the previous layer has finished, large
on-chip storage is required to buffer all of the intermediate data between layers.
3) To efficiently process a large BNN in a single FPGA, each layer must be (1) optimally
designed and (2) avoid reconfiguration. To the best of our knowledge, no existing design
does both.
Our main contribution is to address these three challenges and demonstrate a single-
FPGA design that guarantees µs-level inference latency for ImageNet. We achieve this
performance via the following innovations. First, since BNNs remain computationally in-
tensive (because of the floating point in the BNs), we optimize the structure of the BNN
by fusing the BNs and several other sub-functions (Activation and Binarization). Our op-
timization can be applied not only for conventional CNNs, e.g., AlexNet and VggNet but
also CNNs with shortcuts, e.g., ResNet. Second, we fuse all of the convolution layers and
the first fully-connected (FC) layer. We process them in parallel through fine-grained inter-
layer pipelining. Therefore, the latencies of different layers are overlapped to the point that
the overall latency of the inference is just the latency of a single convolution layer plus
the two FC layers. Note that the dependency pattern of the last two FC layers prevents
them from being fused with the others. Third, unlike the majority of previous FPGA-
based designs, which use data parallelism, we propose a parameterized architecture based
on layer parallelism. This design supports nearly perfect load balancing for various types
of BNNs, leading to nearly 100% pipeline utilization. We also propose a Design Space
Exploration (DSE) method to determine the values of the parameters used in the proposed
architecture.The single-FPGA design achieves 21.5, 335 and 67.8µs inference latency for
Binarized AlexNet, VggNet, and ResNet-18 of ImageNet.
64
4.2 Related Work
There is much previous work in building high-performance BNNs to meet the requirements
of real-time inference for delay-critical applications [Nurvitadhi et al., 2016, Umuroglu
et al., 2017, Zhao et al., 2017b, Liang et al., 2018, Hu et al., 2018]. Because of FP-
GAs’ flexibility and powerful bit-manipulation capability, the majority have been FPGA-
based [Nurvitadhi et al., 2016, Umuroglu et al., 2017, Zhao et al., 2017b, Liang et al.,
2018]. Recently, a CPU-based BNN implementation was proposed [Hu et al., 2018] which
uses bit-packing and AVX/SSE vector instructions to achieve good bit computation per-
formance. However, in this design, Binarized CONV (BConv) is not supported directly;
rather, BConv is converted to Binarized Matrix Multiplication (BMM) through the conven-
tional flatten or unfold approach with expensive pre/post-processing for padding.
We now expand on the FPGA/BNN prior art. FINN [Umuroglu et al., 2017] presents a
framework for building fast and flexible FPGA accelerators using a flexible heterogeneous
streaming architecture. This design can perform millions of classifications per second with
sub-microsecond latency, thereby making them ideal for supporting real-time embedded
applications. However, FINN only demonstrates µs-level-latency inference for small-scale
datasets and networks, such as MNIST and CIFAR-10 and only support networks without
shortcuts, i.e., ResNet is not supported. In this work, our accelerator provides µs-level-
latency inference on large datasets such as ImageNet which are widely deployed in the
real-life applications. Also, our design supports networks with branches, e.g., ResNet. As
mentioned in Chapter 2, the BN function in BNNs use single-precision FP, while the Bina-
rization function uses 0/1 to represent the final results, which leads to a significant preci-
sion gap between these two consecutive functions. Authors of FINN observe this precision
gap and eliminate it by simplifying BN without hurting the accuracy of the network. In
FINN, the original expensive BN functions are replaced by threshold-based comparisons.
By doing so, data from the POPCOUNT layer is binarized directly by comparison with
65
a threshold, leading to potentially much lower latency and less hardware demand. How-
ever, this optimization proposed in FINN can only be applied to the old-fashioned networks
which topology are quite straightforward and have no branches, bypasses or shortcuts. For
modern and particularly important networks, such as ResNet, where intermediate results
of different layers need to be added up before further batch normalization, FINN’s opti-
mization does not fit. ReBNet is another well-known BNN implementation [Ghasemzadeh
et al., 2018]. It provides an end-to-end framework for training reconfigurable binary neu-
ral networks in software and developing an efficient accelerator for execution on FPGAs.
ReBNet improves classification accuracy by representing features with multiple levels of
residual binarization. Their design supports ResNet and large-scale dataset, however, ReB-
Net keeps the expensive BN function in the original ResNet which leads to significant
overhead on latency. In our design, we get rid of all BN functions in ResNet, therefore,
providing significantly low-latency inference.
Furthermore, in most previous work, the network is processed layer by layer, which is
similar to how conventional CNNs are processed and leads to large inference latency. The
small model size of BNN creates the potential to process all layers simultaneously on a
single FPGA, i.e. layer parallelism. However, there is one catch: low pipeline utilization
which can only be solved by careful workload balancing, which has not been quantitatively
utilized in current designs.
Although the majority of recent works focus mainly on the optimization of individual
layers, we believe that inter-layer and whole-network optimization are even more impor-
tant. For example, TensorFlow XLA and Tensor Comprehensions [Vasilache et al., 2018]
recently compiled entire neural network graphs at once, performing various transforma-
tions and achieving 4x speedup over the manually tuned individual layers. Our design
follows this emerging trend by merging all the layers, except the last two FC layers, into a
fused layer.
66
4.3 Network Structure Optimization
This section introduces the optimization approach applied in our work that removes all BN
functions in BNNs. ResNet is used as a motivational example, as its topology includes
not only shortcuts widely used in modern networks but also the classical topology of old-
fashioned BNNs, such as AlexNet.
In ResNet, 2 neighbor CONV layers have heterogeneous structures, while every 2 ad-
jacent layers share the same topology. Among 2 adjacent layers, one is the same as the
CONV layer used in AlexNet (layer2 in Figure 4·1), and the other one is equipped with
shortcuts (layer1/layer3 in Figure 4·1). Figure 4·1 (A) illustrates 3 adjacent CONV layers
in ResNet. The BN outputs of the 1st CONV layer, NORM1, are (1) binarized and then
fed to the 2nd CONV layer; (2) bypassed to the 3rd layer. The 2nd layer is the same as the
CONV layer in binarized AlexNet. At the 3rd layer, the outputs of POPCOUNT, POP3,
are not directly normalized, instead, they are first summed with the bypassed NORM1.
As mentioned in Section II, using FINN’s optimization, the BN function of layer 2 can
be easily replaced by a threshold-based comparison, while the ones in layer 1 and 3 cannot
be removed. In this work, we propose an optimization that replaces the expensive and
complex BNs along with the summation in front of it at layer 1 and 3 with threshold-based
comparison and Threshold Look-up (TL). As shown in Figure 4·1 (B), the comparison is
operated sequentially after POP3 is calculated, while TL can be operated in parallel with
the inference of layer 2 and 3. By doing so, the serious overhead on latency incurred by the
FP-based summation and the operations of BN functions is significantly reduced.
For layers without shortcuts (e.g. layer2), each output channel has its own constant
threshold. The thresholds can be calculated based on Equation 4.1. L refers to the length
of the vector counted in POPCOUNT, i.e. K× K× IC (K: filter size; IC: number of in-
put channels). The POPCOUNT results are binarized by comparing to the corresponding
threshold.
67
Figure 4·1: The original and simplified structures of Binary-ResNet.
T hresholdi, j,k =
E∗, j,k +L j,k
2
− β j,k ·
√
Var[x∗, j,k]+ ε
2 · γ j,k
(4.1)
For layers with shortcuts (e.g. layer3), the calculation of threshold is more complex, as
the value of threshold of each channel depends on not only the layer under calculation but
also the layer feeding the bypassed data. In our optimization, when POP3 is calculated, it
is binarized by comparing to a threshold which is calculated based on Equation 4.2, 4.3,
4.4 by TL. Here, xi, j,k−2 are the popcount results which are the input of TL. Each output
channel has its own constant α j,k and θ j,k, which values are decided during the training
phase.






E∗, j,k +L j,k
2
− β j,k ·
√
Var[x∗, j,k]+ ε











Figure 4·2: Portion of execution time for different functions. 4 pairs of
adjacent CONV layers in ImageNet ResNet-18, 6 CONV layers and 2 FC
layers in Cirfar VGG, 13 CONV layers and 3 FC layers in ImageNet VGG
are measured. For each layer, a pair of bars is given in this chart. The left
and right ones are for original and optimized networks. The execution time
of each function is divided by the execution time of the whole layer with the
original network.
Using our optimization, every CONV/FC layer in BNNs have the following 3 func-
tions: XNOR, POPCOUNT, and Threshold-based Comparison. For layers with shortcuts,
one extra function, TL, is equipped and operated in parallel with other 3 functions. The
proposed optimization of network structure provides significant potential to reduce the in-
ference latency of BNN.
To evaluate the benefits of the proposed optimization, we use ImageNet ResNet-18,
CIFAR-10 VGG-like, and ImageNet VGGNet as motivational examples to show the la-
tency reductions of CONV and FC layers in these 3 networks by applying the proposed
optimization. We also measure the percentage attributed to each function. Here we assume
all functions in a single layer are processed in a pipelined manner and the data parallelisms
are fully utilized. Figure 4·2 demonstrates that, without the proposed optimization, on av-
erage, 60% of the execution time is spent on processing the workload of BN, while only
40% time is for XNOR, POPCOUNT, and the others. With the proposed optimization,
the latency of single layer processing is reduced, on average, to only 30%. For ResNet,
the proposed optimization brings even more latency reduction because it gets rid of the




We construct fine-grained intra- & inter-layer pipelined execution by fusing all the CONV
layers and the first FC layer, leading to effective overlapping among layers for latency
reduction. As a result, all layers except the last two FC layers are processed in parallel and
the overall latency is equal to that of a single layer plus time waiting for dependent inputs.
Figure 4·3: Data dependency of BNN and the proposed fine-grained inter-
layer pipelining for layer fusion
Figure 4·3(A) depicts the data dependency of the BNN using a 3-layer network con-
sisting of two 2×2 CONV layers and one FC layer as an example. To calculate each gray
pixel in layer 2, all gray pixels from layer 1 are required. All the gray pixels at layer 2 share
the same data dependency. To calculate each pixel covered by the brown cuboid in layer 3,
all pixels covered by brown cuboids from layers 2 and 1 are required. To calculate a pixel
70
of the FC output, every pixel in the third layer is weighted and then accumulated; thus,
any pixel in layer 3 can be consumed immediately after it is produced. The latency of cal-
culating pixels in layer 3 determines the overall latency of this 3-layer BNN. To calculate
pixels in layer 3 with the shortest latency, and to make data propagation faster with fewer
intermediate results buffered between layers, pixels that share the same data dependency
(i.e., the pixels covered by the cuboids with the same color at layer 1 and 2) are expected
to be calculated immediately when all dependent data are ready, instead of waiting for the
completion of the previous layer.
The symbols used in the following sections are defined as follows: IC/OC are the
number of input/output channels; K is the filter kernel size; P is the pooling kernel size and
W is the dimension of a feature map.
Intra-layer Pipeline: In order to calculate all the dependent data of an output pixel,
for CONV layers, pixels with the same coordinate and different output channels are ideally
calculated in parallel; pixels with the succeeding coordinates are processed sequentially in
pipeline in the left-right-down direction. To calculate OC pixels with a certain coordinate at
all output channels, IC×OC×K×K XNOR operations are needed. However, in the case
that there are not enough hardware resources to process all these operations in parallel,
maximum possible parallelism should still be obtained. We do this as follows:
(1) All input channels are partitioned into SIC segments, each with PIC input channels. The
input channels in each segment are processed in parallel, while different segments are pro-
cessed sequentially. In each iteration, partial results of all output channels are calculated.
After SIC iterations, complete output features are computed.
(2) All output channels are partitioned into SOC segments, each with POC output channels.
At each iteration, complete outputs of POC output channels are produced. After SOC
iterations, all the output channels are completely processed.
(3) At each iteration, outputs of POC channels are partially calculated using inputs from
71
1 for ho in Image.Height do
2 for wo in Image.width do
3 for scin in SIC do
4 for scout in SOC do
5 for pcout in POC do
6 for pcin in PIC do
7 for kh in kernel.height do




Listing 4.1: Pseudo code for intra-layer pipelining
PIC input channels. All output channels are completely processed in SIC×SOC iterations.
By tuning these 4 parameters, parallelism can be adjusted not only for decreased hardware
resources but also for balancing the workload. The workload balancing scheme is discussed
in Section 4.4.2. Pseudo code is given in Listing 1.
Inter-layer Pipeline: Using the proposed intra-layer pipeline, input data which are
dependent on the same output features are produced quickly. To propagate these output
features faster and so reduce overall latency, operations must begin processing as soon as all
their dependent data are ready. A fine-grained inter-layer pipelining is proposed. Using this
pipelining, all the CONV layers and the first FC layer are fused and processed in parallel.
Taking the 3-layer BNN illustrated in Figure 4·3(A) as an example, when all the gray pixels
at the first layer are ready, the gray pixels in the second layer are computed. Afterward, the
kernel window shifts to the right and then down at the first layer. Immediately after the last
data covered by the brown rectangles are calculated at the first layer, the last data covered by
the brown cuboids at layer 2 is calculated. Then, the pixels covered by the brown cuboids
at layer 3 are calculated. At this point, using the data at layer 3, partial results of the FC
layer are computed. The set of cuboids in the same color indicate the data dependency
from the first CONV layer to the first FC layer. When the data at layer 1 covered by the
green cuboid set has been processed completely, the window slides to the right and reaches
72
the red one. At the same time, the bottom-right features of the green cuboid at layer 2 is
calculated. Then, data covered by the red cuboid at layer 2 starts processing and the pixels
in green cuboid at layer 3 are calculated. In other words, different layers are fused and
processed in parallel and their latencies are overlapped.
Figures 4·3(B) and (C) show, respectively, inter-layer pipelining and the traditional
pipelining used in data parallelism. Our pipelining can significantly reduce the process-
ing latency and the storage demand because the activations are propagated and consumed
quickly between layers. Table 4·4 lists the latencies of each layer and the whole network
of VGGNet. The resulting overall latency is only 1.42x that of a single layer.
Figure 4·4: Latencies of executing each layer of VGGNet inference and the
overall latency of the whole VGGNet when using inter-layer fusion
Using this pipelining approach, in order to get high utilization, it is critical for the data
production rate at a layer match the data consumption rate at the following layer. In the
next subsection, we describe our hardware-based workload balancing strategy.
4.4.2 FPGA Architecture
We use layer parallelism to accelerate the inference of BNNs. That is, all layers are config-
ured simultaneously on one FPGA and processed in a pipeline as described in the previous
section. For each layer, the optimal architecture is deployed with no reconfiguration.
73
The proposed architecture is parameterized. Four main parameters control the architec-
ture configuration: SIC, PIC, SOC and POC. Two goals are achieved by tuning the par-
allelism of each layer. First, modules of all layers are allocated with balanced workloads:
data production and consumption rates between consecutive layers are matched. Second,
the hardware resources can be adjusted according to the available on-chip resources.
Figure 4·5: Parameterized architecture of single CONV layer: SOC, SIC,
POC, PIC can be tuned for workload balancing and to degrade the paral-
lelism
Design of the CONV Layer
Figure 4·5 illustrates the design of a BCONV layer. Data of all input channels from the
previous layer are buffered in the Shared Input Data Shift Register (SIDSR). SIDSR is
composed of K sets of BRAM-based FIFOs. Each set has FIFOs providing data access
capability of PIC bits/cycle. To gain higher concurrency for the BRAM-based FIFOs,
74
data from 32 different channels are packed and stored as a word. Thus, each set needs
PIC/32 BRAM blocks. The data layout of each FIFO is shown in the upper left corner
of Figure 4·5. Data from SIC segments of input channels that are processed sequentially
are stored in an interleaved order. Each FIFO buffers an entire row of the feature map.
Therefore, W × SIC× 32 input channels are stored in a FIFO. K sets of FIFOs are linked
in a head-to-tail manner to support a sliding kernel window and input data reuse. When
enough data from the output channels of a certain layer are ready in SIDSR, the next layer
starts being processed and PIC×K data are broadcast to POC PEs.
Each PE has PIC XNOR engines. Each engine consists K2 XNOR gates, a POP-
COUNT engine, an accumulator, and a comparator. The comparator is coupled with a
local Threshold Buffer (TB) built with distributed RAMs. All PEs work in lockstep under
a control unit. PEs gather weights from the shared weight memory. To saturate the POC
PEs, POC×PIC×K×K weights must be accessed in parallel. We use a hybrid of BRAMs
and Distributed-RAMs to build the Shared Weight Memory (SWM) for buffering weights.
After weights are read from the head of the SWM FIFOs, they are fed into the PEs
and restored in SWM at the tail of FIFOs. Similar to SIDSR, in order to achieve sufficient
concurrency, in case BRAM is adopted to build SWM, weights for 32 channels are packed
and stored as a word in SWM.
If there is a pooling layer, the outputs of PEs are directed to two data paths depending on
whether or not this output is the first element in a row. If so, it is buffered in the Horizontal
Pooling Buffer (HPB); otherwise, it is compared with data that are already cached in the
HPB. The comparison result is buffered in the Vertical Pooling Buffer (VPB) if it falls into
the first row of the pooling kernel. Otherwise, it is compared with the data already cached
in the VPB. The comparison outcome is used as the output of the layer. The data layout of
the pooling buffer is similar to SIDSR; the output data from 32 output channels are packed
before being buffered in the HPB. Again, the outputs are stored in an interleaved order.
75
The Pooling Buffers, SWM, and SIDSR are implemented with a combination of
BRAMs and distributed RAMs. The choice is determined by comparing the required depth
of each memory bank to a threshold. We propose a DSE strategy to decide the thresh-
old and other parameters, i.e., POC&PIC, for the optimal resource utilization. This DSE
strategy is discussed at the end of this section.
Threshold Lookup (TL) in CONV layers
For CONV layers of ResNets (layers connected by shortcuts), TL modules are used instead
of TB buffer. As mentioned in Section 4.3, to lookup each threshold, 2 operations (multi-
ply+addition) are required. In TL module, each lookup engine has a multiply-adder. For a
certain layer, the number of lookup engines is the same as the number of PEs, i.e. POC.
Design of the FC layer
The design of the FC layer is similar to that of CONV, except that: (1) There is no data reuse
since each input channel has its own weight rather than sharing the same filter window. (2)
No Pooling Engine is required. (3) Weights are accessed from DDR instead of SWM.
Support of Workload Balancing
As all the layers are fully configured on the FPGA chip and layers are processed in a
fine-grained pipeline manner, the workload of different layers must be balanced to achieve
maximum hardware utilization. In other words, the output production rate of a layer must
be the same as the input consumption rate for the next layer. We achieve this goal by
adjusting the parameters PIC and POC (or SIC and PIC) of each layer. The parallelism of
a layer is defined in Equations (4.5) and (4). In BNNs, the same padding strategy is adopted
for high accuracy; thus W [l−1]
2
W [l]2 is equal to P[l−1]
2.





Compared with the hardware resource utilization without workload balancing, the proposed
strategy requires only 3.8% of the LUTs, 6.0% of the FFs, and 0.19% of the BRAMs to
complete VGGNet inference with the same latency. Table 4.1 lists the utilizations with and
without workload balancing for VGGNet inference using the KCU1500 as the platform.
KCU1500 is a Xilinx acceleration kit with 20nm Kintex UltraScale family XCKU115
FPGA. This kit has 16 GB DDR4 DRAM and can communicate to host with PCIe Gen3
x16.
Table 4.1: Resource usage of LUTs, FFs, BRAMs, and DSPs for VGG
inference with/without workload balancing (same latency)




















For large BNN networks, without orchestrated resource utilization, it can hardly fit into a
single FPGA. Figure 4·6 shows the different hardware utilization options resulting from
combining various POC and PIC with fixed parallelism for a CONV layer.
The utilization of the entire system can be described by the vector:
~U = (LUT,FF,BRAM), where
~U = f (K,W,P,SIC,PIC,SOC,POC), (4.7)
For a certain BNN layer, some of these parameters are fixed, including K, W, P, IC, and
77
OC. We set PIC,POC as independent variables, so that:
~U = f (x,y),x = PIC ∈ [1, IC],y = POC ∈ [1,OC] (4.8)
As a result, the overall utilization, including conv-engine, popcount, comparators, pooling,
line buffer, and weight buffer, becomes:
~U = ~Uconv + ~Upop&comp + ~Upool + ~Ulb + ~Uwin (4.9)
Since the binary conv-engine and comparator modules are built with distributed LUTs/FFs,
we have
~Uconv + ~Upop&comp = x · y ·~v1 + y ·~v2 (4.10)
The pooling, line buffer, and weight buffer modules all have two implementation choices:
distributed LUTs/FFs or hard Block RAMs (BRAMs), depending on whether the depth is
less than a threshold θ:
~Upool =
{










plut(x,y), if OC · IC · 1x·y < θ
pbram(x,y), otherwise
(4.13)
As a result, we only need to obtain optimal x,y and θ. Figure 4·6(A) shows the workflow.
θ is initialized as θ0.
Step 1: Based on the FPGA resource constraints, calculate the system’s overall parallelism
(i.e., xẏ).
Step 2: Compute the input/output channel parallelism for optimal x and y. Figure 4·6(B)
shows the utilization change with respect to x under a fixed x · y for a BConv layer. Note
that the optimal point lies in the middle of the curve.
78
Step 3: Adjust the ratio between distributed LUTs/FFs and BRAMs for the optimal θ.
Step 4: Check the utilization estimate. If there are unused resources, go back to Step 1 and
repeat.
Figure 4·6: DSE flow and FF utilizations with different combinations of
PIC and POC when fixed parallelisms are used at the 7th CONV layer of
VGGNet
4.5 Experimental Results
The benefits from the proposed network structure optimization approach, layer-fusion and
workload balancing are evaluated and shown in Figure 4·2, Figure 4·4 and Table 4.1 re-
spectively. In this section, we are not going to discuss their benefits separately. Instead, we
evaluate their overall contributions in accelerating BNN inference.
We use a Kintex KCU1500 FPGA to evaluate performance, energy efficiency, and
latency; the networks used are Cifar-10 VGG-like [Courbariaux et al., 2015], ImageNet
AlexNet [Krizhevsky et al., 2012], VGGNet-16 [Simonyan and Zisserman, 2015] and Im-
ageNet ResNet-18 [Canziani et al., 2016, Andri et al., 2018, Lin et al., 2017].
Our results are compared with existing work on BNN acceleration using GPUs, CPUs,
79
Table 4.2: Latency, Performance, Energy Efficiency comparison using dif-
ferent templates, GPUs (Tesla K40 [Liang et al., 2018], V100 [Li et al.,
2019a], and GTX 1080 [Hu et al., 2018]), FPGAs (Stratix V [Liang et al.,
2018], VCU108 [Ghasemzadeh et al., 2018]), CPUs (Xeon E5-2640 [Liang
et al., 2018], Phi 7210 [Hu et al., 2018], and i7-7700 [Hu et al., 2018])
to execute inference of 4 Networks: Cifar-10 VGG-like [Courbariaux
et al., 2015], ImageNet AlexNet [Krizhevsky et al., 2012], VGGNet-16 [Si-
monyan and Zisserman, 2015] and ResNet-18 [Andri et al., 2018].
CPU
Platform Xeon E5-2640 Phi 7210 i7-7700
Frequency 2.4GHz 1.3GHz 3.6GHz
Dataset Cifar ImageNet
Network VGG-Like AlexNet VGG-16
Latency 1.36s 10.8s 11.8ms 16.1ms
Performance (Img/s) 0.74 0.09 85 62
Energy (Img/KJ) 7.79 0.95 395 954
Accuracy (%) 86.31 66.8 76.8 76.8
GPU
Platform Tesla K40 V100 GTX 1080
Frequency 745MHz 1.37GHz 1.61GHz
Dataset ImageNet Cifar ImageNet
Network AlexNet VGG-Like AlexNet VGG-16
Latency 1.26s 994µs 2.23ms 12.9ms
Performance (Img/s) 0.79 1006 448 78
Energy (Img/KJ) 3.36 5543 2475 433
Accuracy (%) 66.8 89.9 71.2 76.8
FPGA ASIC
Platform Stratix V VCU108
UMC 65-nm
[Andri et al., 2018]
Frequency 150MHz 200MHz 450MHz
Dataset Cifar ImageNet
Network VGG-Like AlexNet AlexNet ResNet-18
Latency 130µs 1.16ms 1.92ms 8ms
Performance (Img/s) 7692 862 521 125
Energy (Img/KJ) 2.9E5 3.3E4 2.7E4 2.8E5
Accuracy (%) 86.31 66.8 N/A N/A
FPGA
Platform This work: KCU1500
Frequency 200MHz
Dataset Cifar ImageNet
Network VGG-Like AlexNet VGG-16 ResNet-18
Latency 8.2µs 21.5µs 335µs 67.8µs
Performance (Img/s) 1.2E5 4.7E4 2817 1.47E4
Energy (Img/KJ) 3.6E6 1.4E6 8.3E4 3.7E5
Accuracy (%) 88.5 72.7 74.3 65.6
80
and FPGAs. As GPUs and CPUs are extremely underutilized when executing BNNs in
no-batch mode. We mainly compare our results with a recently published high-end FPGA-
based BNN accelerator. As shown in Table 4.2, our latencies of Cifar-10 VGG-like and
AlexNet are 15.9x and 54.0x less than the state-of-the-art FPGA-based BNN design. We
also measure the inference latency of VGGNet of ImageNet. It takes 335µs to complete the
inference of a 224×224 image. Using our design, 2817 224×224 images can be inferred
per second with the accuracy of 76.8%. The latency of VGGNet is not compared to the
existing FPGA accelerators, as our work is the first work to accelerate Binarized VGGNet
of ImageNet on single FPGA. There is no existing FPGA-based work reporting the infer-
ence latency of binarized VGGNet. Thus, our latency, performance, and energy efficiency
results are compared with results of CPUs and GPUs. Our latency and performance are
at least 33.2x better than the state-of-the-art CPU results. For ResNet-18, the latency is
67.8µs. Comparing to the existing ASIC work, YodaNN (YodaNN uses 12-bit fixed-point
intermediate feature maps), LP-BNN is 118× faster. Here, we do not compare our result
to ReBNet [Ghasemzadeh et al., 2018], as the structure of the ResNet used in their paper is
self-customized.
Performance is evaluated with respect to Image/s. Here, the number of images which
can be inferred in one second is used to show the processing capability of our design. Con-
sidering the I/O ports on FPGA board can be customized and each FPGA chip provides
hundreds of high-bandwidth I/O resources, we do not take I/O constraints into considera-
tion in the performance evaluation.
Energy efficiency is evaluated with respect to Images/J. Our result is, on average, 47.3×
better than previous work. The pipeline utilization of our design for VGG-16 is 99.7%, i.e.,
the percentage of idle stages in the pipeline is only 0.3%. The resource usages for the
implementation of VGGNet are listed in Table 4.1. The highest operating frequency which
can be deployed is 310MHz. In order to provide a fair comparison to the existing works and
81
highlight the benefits of our design instead of the delicate implementation, we use 200MHz
as the operating frequency.
4.6 Conclusions
In this chapter, a single-FPGA-based accelerator for ultra-low-latency inference of BNNs,
LP-BNN, is proposed. Our design can complete the inference of Binarized ImageNet,
such as AlexNet, VGG-16 and ResNet-18, within 21.5µs, 335µs and 67.8µs respectively,
with no accuracy loss compared with other BNN works. For small-scale networks, such as
VGG-like in Cifar-10, the latency is only 8.2µs. The proposed accelerator and the resulting
ultra-low latency inference make it possible to deploy deep neural networks in real-time
applications, such as autonomous cars and robotic control.
82
Chapter 5
O3BNN-R: An Out-of-Order Architecture for
High-Performance BNN Inference with
redundant operation pruning
This chapter introduces the architecture of O3BNN-R (Out-of-Order scheduling-based
BNN with optional Regularization support). O3BNN-R is able to handle operation-level
irregularity by using a novel fine-grained runtime out-of-order scheduler to efficiently de-
tect and skip superfluous operations. BNNs are used as an example of handling operation
irregularity. This chapter is based on the works published in the 33rd International Con-
ference on Supercomputing (ICS) ©2019 ACM [Geng et al., 2019b] and Transactions on
Parallel and Distributed Systems (TPDS) ©2021 IEEE [Geng et al., 2021].
5.1 Introduction
Deep-Neural-Networks (DNNs) are in widespread use due to their ability to learn well
enough to achieve high accuracy [Geng et al., 2020,Ji et al., 2013,Geng et al., 2018,Karpa-
thy et al., 2014, Wang et al., 2020]. However, for many high-volume, but cost- or power-
restricted applications, accuracy is not an absolute requirement [Hubara et al., 2017, Tang
et al., 2017]. Rather, reaching a certain well-defined level of accuracy is often sufficient,
but with low cost and low-latency–or even real-time response–being highly desired. This is
especially true for IoT and smart-edge devices [Zhou et al., 2016, Hubara et al., 2016, Wei
et al., 2017, Lu et al., 2017].
83
As mentioned in Chapter 4, because they satisfy these requirements, Binarized Neural
Networks (BNNs) [Rastegari et al., 2016] have recently received much attention. BNNs
use single bits to represent features and parameters, thus significantly reducing computa-
tion complexity from floating point or integer to Boolean and memory demand from bytes
per datum to bits. This introduces the potential to significantly reduce the inference delay at
the cost of loss in accuracy. A recent study by Bethge et al. [Bethge et al., 2020] shows that
well-designed BNN structures can achieve comparable and even superior accuracy (70%
Top-1 accuracy for ImageNet) compared with state-of-the-art condensed full-precision net-
works such as MobileNet.
Having only two values per neuron, a BNN’s network structure is significantly dif-
ferent from a conventional DNN’s. These differences expose various new optimization
opportunities. For example, Umuroglu, et al., [Umuroglu et al., 2017] show that the Batch-
Normalization (BN) functions in most BNNs can be simplified to a threshold-based com-
pare and thus avoid the floating point (FP) calculation. Fujii, et al., [Fujii et al., 2018] use
neuron pruning, which eliminates neurons in the case where the sum of weights is lower
than a pruning-threshold, and retrains the network for this adjustment. By doing so the
number of neurons, and so the associated computation, is reduced. The accuracy, however,
is compromised.
The work here is motivated by these previous studies together with the following two
observations. First, in a BNN, a neuron’s output is a Boolean whose value is determined by
comparing the accumulation of all dot-products of the edges linked to this neuron with a
fixed threshold that has been determined during the training phase. The idea is that we can
immediately cease further computation of the dot-product and return (a) 1 as soon as the
current accumulation becomes larger than the activation threshold (Figure 5·1-A); or (b) 01
as soon as it is found that the current accumulation has no chance of reaching the threshold
(Figure 5·1-B). This cessation is analogous to breaking out of a loop as soon as the result








(A) i0+i1     Threshold     
Out=1.                     
i   [0,1]
2
Out
(B) 1×2<Threshold-i0-i1      
Out=0                    










Figure 5·1: Three types of pruning: (A) & (B) Threshold-based edge prun-
ing; by accumulating the inputs (ix) and comparing the accumulation result
to a threshold, the value of a neuron (Out) is calculated (binary 1/0). (C)
Pooling-based edge pruning. Out from pooling is binary (1/0).
is determined. We call this approach threshold-based edge pruning and refer to the two
cases as Condition 1 and Condition 2.
A second observation is about pooling, in particular max-pooling, which for BNNs is
the most widely. In the case where any one of the n×n inputs (typically 2×2) is 1, the
pooling result must also be 1; thus we can avoid evaluating the remaining entries. For
example, in Figure 5·1-C the second entry is a 1 so we can prune the computation for the
last two neurons. We refer to this approach as pooling-based edge pruning. Analogous
methods work for min- and mean-pooling.
Although both observations are straightforward, the efficient harvesting of these prun-
ing opportunities is challenging. This is because both are irregular, occasional, data-
dependent, run-time, and strongly dependent on specific evaluation order. For threshold-
based edge pruning, it is difficult to decide when the partial accumulation will surpass the
threshold, or when we can assert that it will never reach the threshold. Pooling-based edge
pruning is similarly difficult: it may eventually turn out that all entries are 0.
Exploiting these opportunities requires that the design be extremely flexible and dy-
namic. On the one hand, the control unit must frequently assess the current accumulation
and be capable of immediately terminating the remaining execution of the neuron. This
appears to require that the computation be sequential for the sake of pruning; yet, we still
85
need parallelism to guarantee high performance. On the other hand, in case the evaluation
of a neuron is terminated early, the execution gap needs to be filled instantly to avoid losing
performance through pipeline bubbles. This combined challenge has been considered to
be very difficult [Fujii et al., 2018]. In this chapter, we address these difficulties with an
out-of-order edge pruning architecture: O3BNN-R.
To further enhance the potential gain, we propose an architecture/algorithm co-design
approach during network training. We add two regularization terms to the loss function,
for threshold-based and pooling-based pruning approaches, respectively. These two terms
create more pruning opportunities without sacrificing accuracy by allowing the respective
decisions to be made earlier: for threshold-based pruning, the regularization term moves
the thresholds closer to either 0 or the maximum value; for pooling-based pruning, the term
moves the 1 elements towards the upper-left corner of the pooling windows.
The main contributions of this work are as follows:
• Two run-time approaches to edge pruning for BNN inference: threshold-based and
pooling-based;
• A 2D-rotative out-of-order (OoO) design for dynamic workload scheduling and bal-
ancing;
• An architecture called O3BNN-R that implements efficient run-time BNN inference
pruning; and
• Regularized training that enhances the pruning rate.
We evaluate the design on an FPGA platform using VGG-16 [Simonyan and Zisserman,
2015], AlexNet [Krizhevsky et al., 2012] for ImageNet, and a VGG-like network [Cour-
bariaux et al., 2015] for Cifar-10. Evaluations demonstrate that the out-of-order approach,
without regularization in training, can prune 27%, 19%, and 42% of the operations for the
86
three networks, respectively, without any accuracy loss. This brings at least 2.1×, 1.5×,
and 1.7× speedups, respectively, and on average 47×, 23×, and 32× energy-efficiency
improvements, respectively, over state-of-the-art FPGA/GPU/CPU BNN implementations.
With training regularization, the performance of O3BNN-R is further improved, on aver-
age, by 15% and with only 0.5% accuracy loss.
The organization of this chapter is as follows. In Section 5.2, we give the motivation
of edge pruning. In Section 5.3, the edge pruning opportunities are introduced. In Section
5.4, an Out-of-Order BNN pruning design is proposed. In Section 5.5, the regularization
augmentation is described. In Section 5.6, experimental results are presented and analyzed.
In Section 5.7, related work is discussed. Section 5.8 provides a conclusion.
5.2 Motivation for Redundant Edge Pruning
Researchers have observed various opportunities to further optimize the basic BNN struc-
ture. As mentioned in Chapter 4, in FINN [Umuroglu et al., 2017, Blott et al., 2018]
BN and BIN are merged. As shown in Figure 5·2, the original FP-based BN function
in Equation 2.4 and the BIN function in Equation 2.5 are integrated as a threshold-based
comparison. The threshold can be calculated according to the following equation:
T hresholdi, j,k =
E∗, j,k +L j,k
2
− β j,k ·
√
Var[x∗, j,k]+ ε
2 · γ j,k
(5.1)
where L is the length of the vector K×K×IC, K is the filter size, and IC is the number of
input channels. Note that γ j,k and β j,k are learned in training and fixed in inference. In
this way, the FP operations in BN now become a simple threshold. With a fixed threshold,
our new observation is that we can prune certain computations in CONV/FC when a partial
accumulation is already sufficient to obtain a result.
Another motivating study relates to neuron pruning [Fujii et al., 2018]. In the FC layers,
when the sum of weights for a neuron’s linking edges are smaller than a threshold, this
87
Figure 5·2: A typical 3-CONV-1-FC BNN Network structure. It is similar
to DNN, except that Activation acts as BIN, Multiplication acts as XNOR,
Accumulation acts as POPCOUNT.
neuron is noted as inactive and is pruned. Fine-tuning is required and accuracy degrades.
The authors also mention edge pruning, expecting that it can be more beneficial than neuron
pruning, but do not pursue it due to the irregularity of the structures and the difficulty of the
hardware implementation. Here we demonstrate that edge pruning is feasible and propose
a dynamic out-of-order architecture that implements it with little hardware overhead, and
with no (or managed) accuracy loss.
5.3 Pruning Opportunities
In this section, we first introduce the basic design of BNNs and then discuss the pruning
possibilities. BCONV denotes bit convolution. The bit-fully-connected layers are treated
as 1×1 BCONVs.
5.3.1 Basic BCONV design
Figure 5·3 shows pseudocode for a BNN with BCONV layers where K is the convolution
filter size, NIC is the number of input channels, NOC is the number of output channels,
WIDT H and HEIGHT are the width and height of the feature maps, and LAY ER is the
88
Figure 5·3: Pseudo code of a traditional BCONV/BFC without pruning and
the symbols of edge and curve.
number of layers.
There are 7 loops. Each iteration of Loop 7 processes an edge, i.e., XNOR + POP-
COUNT, in the network graph (in red in Figure 5·3). Each iteration of Loop 5 is called a
curve; it processes K×K edges per input and output channel, i.e., a convolution window
(in green in Figure 5·3). The resulting value of a curve is the aggregation of its K×K
edges. We annotate IC & OC along the curve to indicate the index of the curve in Loop 5
and Loop 4. We also annotate H & W at the front of the curve to indicate its index in
Loop 2 and Loop 3. Therefore, a curve indexed by [IC,OC,W,H] represents the work-
load of evaluating a convolution window of K×K neurons for input channel IC and output
channel OC at location [W,H] of the input feature map. The complete calculation of each
output channel requires the accumulation of NOC curves (Loop 4); the entire BCONV layer
requires NIC×NOC×HEIGHT ×WIDT H curves. In this design a curve is the basic gran-
ularity for edge pruning. Existing work [Nurvitadhi et al., 2016, Fujii et al., 2018, Zhao
et al., 2017b] generally exploits parallelism in the loop-nest through:
• Loops 6-7: Parallel execution for K×K edges in a curve.
• Loop 5: Parallel evaluation of different input channels IC for the same output channel
OC.
89
Figure 5·4: (A): illustration of the evaluation process of an output neu-
ron using threshold-based BN function; (B)&(C): Conditions of threshold-
based edge pruning
• Loop 4: Parallel processing different output channels OC.
• Loops 2-3: Usually not parallelized to ensure data reuse across neighboring [H,W ]
(i.e., neighboring CONV windows overlap).
• Loop 1: Usually not parallelized since a hardware implementation to exploit model
parallelism may suffer from layer-wise workload imbalance and excessive storage
demand for intermediate results.
5.3.2 Threshold-based Edge pruning
Figure 5·4-A illustrates threshold-based BN for each output channel (OC in Loop 4): (1)
calculate and accumulate NIC curves for this output channel, i.e., Loop 5; and (2) binarize
via threshold comparison. Since the threshold is fixed in inference and the output is a
binary value, we do not necessarily need to evaluate and accumulate all the curves before
making a comparison. In other words, if the partial results are already sufficient to imply
the output bit, we can avoid the evaluation and accumulation of the remaining curves. In
the following, we use ACC_Cur to denote the accumulated partial curves, ACC for the
accumulation results for ACC_Cur curves, and T for the threshold.
Condition 1: ACC > T implies that the remaining (NIC − ACC_Cur) curves can be
90
pruned. As both input features and weights in BNNs are binary (1/0), the curve’s value
is always non-negative and accumulation never decreases ACC. Consequently, whenever
ACC exceeds T , the binarization result is 1 and will never flip to 0 during the remaining
accumulation. Therefore (NIC−ACC_Cur) curves can be pruned, as shown in Figure 5·4-
B.
Condition 2: ACC < T −K2× (NIC−ACC_Cur) implies that (NIC−ACC_Cur) curves
can be pruned. Conversely, as the maximum value of each curve is K2 (all XNOR results
are 1), with ACC_Cur input channels already accumulated in ACC, for the rest evaluation
of (NIC−ACC_Cur) curves, the maximum possible contribution is K2×(NIC−ACC_Cur).
Therefore, if ACC+K×K× (NIC−ACC_Cur) is less than T , then ACC will never reach
T ; we can safely prune the remaining (NIC−ACC_Cur) curves and output 0, as shown in
Figure 5·4-C.
Implementation Challenges: (1) To prune within Loop 5, it must execute sequentially;
this may inhibit parallelism that can otherwise be exploited. (2) Due to pruning, the latency
for each iteration in Loop 4 can differ substantially, which may lead to workload imbal-
ance. (3) Dynamic, asynchronous, and data-dependent slacks must be filled immediately;
otherwise pruning will not result in any performance benefit. (4) The hardware overhead
for verifying pruning conditions, ceasing the present execution, and stealing new jobs for
workload balancing must be limited.
5.3.3 Pooling-based edge pruning
Given a threshold-based BN design, we now consider the pooling function (Figure 5·5-A).
Figure 5·5-B shows how the four entries of a 2×2 pooling window are sub-sampled after
a convolution. As the entries are binary, the max operation is equivalent to a bitwise-OR
among the four entries. Therefore, once an entry is identified as 1 (e.g., the first entry in
Figure 5·5-B), the pooling result must be 1 and we can safely prune the evaluation of the
91
Figure 5·5: (A): BNN structure used in this work: threshold-based BN
followed by POOLING; (B): The condition of Pooling-based Edge Pruning.
remaining entries. For example, the convolution of three entries in Figure 5·5-B is pruned.
Implementation Challenges: (1) To prune the pooling entries, the computation of these
entries must be processed sequentially, limiting parallelism. (2) Pruning may lead to work-
load imbalance. (3) The dynamic and data-dependent slacks due to pruning must be lever-
aged effectively. (4) Extra delay and hardware overhead must be limited.
5.4 Out-of-Order BNN Pruning Design
A critical question to be addressed is the conflict between the need for sequential execution
to facilitate pruning and parallel execution to obtain performance. In this section we first
present a trade-off strategy and a method to compensate for compromised parallelism. We
then show how to achieve workload balance via rotative workload scheduling. Finally, we
discuss the O3BNN-R hardware implementation.
5.4.1 Parallelization Strategy
To achieve threshold and pooling edge pruning, Loop 5 must be executed sequentially
and Loop 4 partially sequentially. To compensate for this reduced parallelism, we exploit
the inter-layer parallelism (i.e., model parallelism) from Loop 1. Note that data reuse in
Loops 2 and 3 is still critical for performance. We resolve layer-wise workload imbal-
92
Figure 5·6: Three methods of workload scheduling as described in the text.
ance by allocating computation resources proportional to the per-layer workload. For large
storage demand, we adopt a layer-fusion technique, as referred to in [Alwani et al., 2016].
Overall, parallelism from K (Loops 6-7), OC (Loop 4), and L (Loop 1) are exploited for
parallel execution. The pooling pruning at different output channels is applied in parallel.
In the case that more parallelism is needed, Loops 2 and 3 can be unrolled and processed
partially in parallel. By doing so, pooling pruning at the same output channel can also be
conducted in parallel.
93
5.4.2 Rotative Workload Scheduling
For clarity and without loss of generality, let us first assume that 4 PEs process a BNN layer
with 8 input and 8 output channels (IC=8, OC=8). We present three approaches to show
the evolution of the design: in-order, 1D Rotative OoO, and 2D Rotative OoO.
In-order Scheduling: All PEs work in lock-step. Whenever ACC (accumulation of
curves) at one PE triggers one of the threshold pruning conditions, this PE aborts and re-
mains idle (Figure 5·6-A). The simple in-order design has two advantages. The first is low
storage demand. Since the NOC output channels can be statically partitioned among PEs
(e.g., PE2 in Figure 5·6-A always processes OC-2 and OC-6), the weights for convolution
can be distributively conserved, saving memory space. The second advantage is simple
data feeding logic under fixed curve mapping. The drawback is that it does not benefit
significantly from pruning (except when all PEs are pruning, which is rare), while wasting
computation resources and creating pipeline bubbles.
1D Rotative OoO: In the basic OoO design, a new curve immediately fetches and fills the
gap from pruning. For example, in PE2, curves with OC=6 (in dark blue) issue after curves
with OC=2 (in green) are pruned in cycle-3. To efficiently address the target curve loca-
tion (using IC and OC) in the input feature map, which is stored sequentially in (possibly
very large) memory, we propose a vertical rotative design. We use a 1-8 counter for IC
while dynamically controlling the value of OC for OoO. As shown in Figure 5·6-B, PEs
execute curves consecutively with IC rotating through 1 to 8. When pruning occurs, OC
is updated following numerical order, i.e, new curves that have the next unprocessed OC
ID are assigned. We refer to the group of curves with the same OC, but different IC, as a
curve-group.
This design executes OoO to exploit pruning opportunities without introducing any
pipeline bubbles. A major shortcoming, however, is storage cost: since it is not known in
advance which OC will be fetched for a particular PE (pruning is data dependent), every PE
94
can potentially process any OC with any IC. Therefore, every PE must retain a local copy
of the entire set of weights. If the weights were shared globally, a very clever data-feeding
circuit would have to be designed to issue the required weights to corresponding PEs on
the correct cycle. This would introduce delay as well as area and power overhead. Also,
neither approach is scalable to a large number of PEs.
2D Rotative OoO: To resolve the memory issue, we propose 2D rotative OoO. The idea
is to distribute the weights among PEs using a time-sharing approach. Specifically, rather
than partitioning curves among PEs along OCs (as with the in-order design), we partition
along ICs. In other words, each PE statically handles a portion of ICs. For example,
in Figure 5·6-C PE3 only processes IC-5 and IC-6. Consequently, weights can also be
statically partitioned along IC and distributively reside in the PE’s local memory. For the
horizontal rotation, PEs are connected to their right neighbors forming a unidirectional loop
among PEs (horizontally in Figure 5·6-C). When a PE finishes its portion of ICs, it forwards
the unfinished curve-group to the side buffer of its right neighbor. To fetch a curve each
cycle for execution, a PE first checks its left-side buffer and continues the unfinished curve-
group in case presented; otherwise, it fetches a new curve-group (the current curve-group
is either pruned or completed) and starts execution.
To summarize: by simultaneously rotating the curve groups along the vertical dimen-
sion, we dynamically dispatch them with desired pruning capability and with low memory
addressing cost. This is done while distributively sharing weights among PEs in a time-
sharing manner by rotating along the horizontal dimension. Given these advantages, the
2D rotative design is used for the O3BNN-R hardware implementation.
5.4.3 O3BNN-R Architecture
We introduce O3BNN-R architecture as shown in Figure 5·7. To achieve workload balanc-
ing, the PEs and other hardware resources are allocated roughly proportionally to workload
95
Figure 5·7: Overall architecture of O3BNN-R; architectures of PE array,
Scoreboard and DFS are shown in Figure 5·9, 5·8 and 5·10.
per-layer. In this way, layers linked in a daisy chain can cooperate effectively in a deeply-
pipelined manner, exploiting inter-layer parallelism from Loop 1. Each PE contains 3 major
modules: PE array for workload execution, Score-Board for tracking curve execution sta-
tus and ensuring in-order commitment, and Data Feeding System, or DFS, for buffering
and feeding correct input data.
Figure 5·8: Architecture of PE array
Processing Element Array
Figure 5·8 shows the detailed architecture of the PE array. To realize horizontal rotation,
PEs are linked via a unidirectional circular communication network with two channels:
one for forwarding the unfinished curve-group (red-line) and one for conveying the present
ACC value. Inside a PE, there are three buffers. (1) The buffer at the bottom-left is used
96
to buffer and reuse input feature at a particular [H,W ] (i.e., reuse input feature data across
OCs), with each curve per time slot. The 2-to-1 multiplexer linked to this buffer is used to
select among reused input features for the next OC (i.e., curve-group), or buffering a new
input feature from the next image pixel [H,W ]. The FIFO loop-back realizes the vertical
rotation by repeatedly reusing the buffered data for different OCs. (2) The middle buffer
is for distributively storing weights with each PE holding NIC/PEs×NOC curve entries.
The input feature and weights for a curve are XNORed and accumulated in parallel (into
ACC). (3) The upper-right buffer is for pending data for inter-PE communication. The 3-
to-1 multiplexer selects from: (a) 0-input for a completely new curve-group; (b) self-input
for continuous accumulation within its curve-group portion; (c) neighbor-input to start its
portion of a curve-group following its left neighbor. The 2-to-1 multiplexer chooses from
continuously processing its curve-group portion or conveying the curve-group to its right
neighbor PE.
Scoreboard
Figure 5·9: O3BNN-R Scoreboard
97
Analogous to reservation stations in Tomasulo’s algorithm [Tomasulo, 1967, Hennessy
and Patterson, 2011] for exploiting instruction-level-parallelism (ILP) in an OoO CPU, the
Scoreboard here tracks curve-group execution status and enforces commitment of curve-
groups in the right order. As shown in Figure 5·9, each entry tracks a curve-group and has
three basic fields: OC for curve-group ID, status for control, and the 1-bit output for this
OC. Each column with NOC entries in the same color tracks NOC curve-groups for the
pixels with the same [W, H] at all (NOC) output channels. In our design, the 1-bit outputs
of entries in the same column are committed to the succeeding layer simultaneously. Thus
the number of entries in the Scoreboard is an integral multiple of NOC.
We define the OoO capability of an O3BNN-R design as the number of column en-
tries in its Scoreboard. The coordinate field is for tracking curve-groups with multiple [W,
H]s and used when a multi-column Scoreboard is needed for stronger OoO capability (dis-
cussed later). The pooling field is used when pooling is required after the convolution,
where curve-groups belong to the same pooling window share the same Scoreboard entry.
When performing convolution for the elements covered by the same pooling window, in
case an element returns 1, the status field is marked as “skip” and the output field is set to
1. With skip status, the remaining elements sharing the same Scoreboard entry will not be
issued to PE array anymore, i.e. pruned.
Data Feeding System
DFS is designed to effectively feed a K ×K window for a curve-group from the input
feature map, which is typically stored sequentially in [H,W, IC] order, while the K ×K
window is from [H,W ]. For efficiency, a simple segmented line-buffer design is proposed
(see Figure 5·10): the entire input feature map flows along each segment of the line-buffer,
with the K vertical segments extracting the required rows of the K×K window and feeding
into the PE array when a new curve-group is requested.
98
Figure 5·10: Architecture of DFS
5.4.4 Design Extensions
In this section, we sketch extensions to O3BNN-R. First, we describe adding a relaxing
factor to the threshold and how this affects accuracy and performance. We then analyze
how applying different OoO capabilities affects performance.
Relaxing the Threshold
None of the pruning designs described so far lose accuracy. But if accuracy can be compro-
mised slightly, then more benefits can be gained. The idea here is to augment the threshold
with a relaxing factor δ ∈ [0,1].
For Condition 1, we relax T to a lower threshold, δ×T , so that it is triggered earlier and
the calculation of a neuron stops before the accumulation result surpasses T. By doing so
more operations are pruned, but accuracy may decrease: we now assume all curve-groups
which surpass δ× T eventually surpass T, which may not happen. For Condition 2, we
relax T to a higher threshold, (1+(1− δ))×T . Similar to Condition 1, pruning rates are
increased while accuracy may decrease. This trade-off between accuracy and pruning is
discussed in Section 5.6.1
99
OoO capability
Because of in-order commitment, when the Scoreboard has only one column of entries
it can only track NOC curve-groups at the same time for the pixels with the same [W,H].
New curve-groups with new coordinates cannot be issued before the present curve-groups
are completely evaluated, which may limit OoO capability. If more hardware resources
are available for the Scoreboard, we can track multiple NOC curve-groups at the same time
(each per-column as shown in Figure 5·7(E)), by assigning different coordinates to different
columns of entries. This is a trade-off between hardware consumption and OoO capability:
a larger Scoreboard provides higher OoO capability.
Figure 5·11 shows how different OoO capabilities affect the execution of BNN infer-
ence. Each block refers to the execution of a curve whose OC and IC IDs are indicated by
the head of its row and column. To illustrate, we use a BNN layer with 8 ICs and 8 OCs.
Each cluster of blocks (8 × 8 blocks in this figure) represents all curves that need to be
evaluated to completely calculate features with the same [W,H]. The number on each block
shows the iteration during which the curve is calculated. Different block colors represent
execution at different PEs.
In Figure 5·11(A), the Scoreboard only has one column of entries, i.e., OoO capability
= 1. After 8 iterations, there are no pending curve-groups with [0,1] in the Scoreboard
waiting to be forwarded to PEs. PE1 becomes ideal: it can start to work on the curves of
[0,2] at the 9th iteration. However, as the Scoreboard with OoO capability = 1 can only
track NOC curve-groups with the same [W,H] at the same time, the next NOC curve-groups
cannot be forwarded to PEs until the current ones are completely done and flushed from the
Scoreboard. Therefore, synchronization is required between each cluster of blocks. In this
example, synchronization takes 4 iterations out of 12 iterations of execution. At iteration
13, the calculation of curve-groups with [0,2] starts.
Figure 5·11(B) illustrates the execution of curve groups of [0,1] and [0,2] with a score-
100
Figure 5·11: Workload execution of O3BNN-R with different OoO capa-
bilities. Four PEs are shown.
board having OoO capability = 2. After iteration 8, the second column of entries in the
scoreboard starts to track the status of curve-group [0,2] immediately and without syn-
chronization. As shown in Figure 5·11(B), at iteration 9, idle PEs start to calculate the
curve-groups of [0,2], 4 iterations earlier than (A). For O3BNN-R with OoO capability =
2, synchronization is required in the following scenario: when all curve-groups of [0,2] are
already distributed to PEs by the Scoreboard and there are still curve-groups of [0,1] being
processed in PE array. To avoid this synchronization, a Scoreboard with OoO capability =
3 is needed. The third column can then be used to track the curve-groups of [0,3]. Overall,
the higher OoO capability the lower the probability that synchronization is needed. Based
on our experiments, a 2-column scoreboard provides an optimal trade-off of performance
and hardware consumption. This is discussed further in Section 5.6.2.
101
5.5 Regularized Training
So far we have focused on BNN inference, in particular, we have assumed that the weights
for BNN inference are fixed. In this section we propose a training/inference co-design
approach that enhances the utility of both threshold-based and pooling-based edge pruning.
5.5.1 Regularization for Threshold-based Pruning
Threshold-based edge pruning is triggered under two conditions: (1) the accumulation has
already surpassed the threshold; (2) the accumulation cannot reach the threshold even if all
remaining partial results are 1s. It follows that there is a higher chance of pruning when
the accumulation stays away from the threshold, either larger (approaching max bound
K×K×NIC) or smaller (approaching min bound 0), as shown in Figure 5·12-(A). This
can be achieved by adding a regularization term to the loss function during training. This
term increases the loss value when the absolute difference between output and threshold
diminishes, and decreases the loss value when the difference expands. As the SGD-based
optimizer iteratively minimizes the loss function during training, we can obtain a BNN
network model having a higher chance of early threshold-based pruning while suffering
little accuracy loss. This is possible because of the redundancy in network parameters and
input images, and because of the robustness of SGD optimization.






where α is a scalar weighting factor which can be adjusted to find the optimal point in the
trade-off between pruning benefit and accuracy loss.
102
5.5.2 Regularization for Pooling-based Pruning
As discussed in Section 5.4.1, the features covered by the same pooling window are eval-
uated sequentially. Therefore, in the case that one of these features is determined to be 1,
all operations for evaluating the remaining features are pruned and the pooling output is set
as 1. The evaluation of features covered by a pooling window follows a particular order,
i.e., from top-left to bottom-right. For example, the order for a 2× 2 pooling window is
top-left→top-right→bottom-left→bottom-right. If during training we encourage the first
features in the evaluation chain to be 1, the pooling-based pruning will be triggered earlier
and pruning opportunities will increase (as shown in Figure 5·12-(B)). This is achieved by
adding another regularization term to the loss function.
We can encourage the 1 values to move forward in the evaluation chain by giving dif-
ferently weighted rewards to different features covered by a pooling window. For example,
for a 2× 2 pooling window, the reward is (1) 1.0 for the top-left feature; (2) 0.6 for the
top-right one; (3) 0.3 for the bottom-left; (4) and 0 for the bottom-right. That is, the reward
is 1.0 if the top-left entry is 1, regardless the values of other entries; the reward is 0.6 if top-
left is 0 and top-right is 1, regardless the values of the remaining two features; the reward
is 0.3 when the first 2 features are both 0 and bottom-left one is 1; finally, no reward if all
features are 0.






where L is the number of pooling layers, H ×W is the size of output feature maps for
the pooling layer, C is the number of channels, reward(xi) is the reward value reward de-
termined during the evaluation of a pooling window, and β is a scaling factor which can
be adjusted to find the best point of the trade-off between pruning benefit and accuracy
loss. Since the weights of BNN models are discrete, we adopt the ln function to make the
103
Figure 5·12: Regularization during Training
Table 5.1: Structures of the Networks used to evaluate O3BNN-R. 512FC
refers to a fully-connected layer with 512 neurons. 2x128C3 refers to 2
convolution layer with 128 output channels and 3x3 filter. MP2 refers to a
2x2 max-pooling layer.

















regularization function more smooth.
Overall, in the O3BNN-R training/inference co-design, we add two regularization terms
to the original loss function during training to introduce more opportunities for threshold-
based pruning and pooling-based pruning:
loss = lossnetwork +RThreshold +RPooling
104
Figure 5·13: Pruning rate vs Accuracy trade-off with different relaxing fac-
tors. When the relaxing factor is 1, pruning is lossless. Green lines and bars
are for original models; pink, orange and blue lines and bars are for models
trained with different combinations of regularization techniques.
5.6 Evaluation and Experimental Results
In this section, we evaluate the efficiency of O3BNN-R by showing two trade-offs and
comparing O3BNN-R with state-of-the-art FPGA, GPU, and CPU implementations. First,
we discuss the trade-off of pruning rates versus network accuracy by adjusting the relaxing
factor of thresholds, δ (Section 5.4.4.1). Second, we give the trade-off of hardware re-
source demand versus performance by adjusting the OoO capability of O3BNN-R (Section
5.4.4.2). Finally, using the most efficient OoO and, in the case where lossy pruning is used,
the optimal relaxing factors obtained from trade-off analysis, the efficiency of O3BNN-R
is compared with the state-of-the-art BNN implementations of FPGAs, GPUs, and CPUs.
We use PyTorch to build and train our BNN model. Each network is trained under 4
different configurations: without any regularization, only with pooling regularization, only
with threshold regularization, and with both.
105
The trained models are evaluated for profiling the ideal pruning rates and measuring
model accuracy, i.e., the inference accuracy on the testing set. Performance, hardware
demand, and energy efficiency of O3BNN-Rs are evaluated on an embedded-scale FPGA
development kit, Xilinx ZC706, which is one of the most widely used platforms in em-
bedded systems, robotic control, autonomous cars, and research prototyping [Blott et al.,
2018, Umuroglu et al., 2017]. In order to show the scalability of the proposed design,
all O3BNN-Rs used for evaluation are equipped with 512 PEs. The FPGA results are
compared with two Intel CPUs (Xeon-E5 2640 [Liang et al., 2018] and Xeon-Phi 7210
accelerator [Hu et al., 2018]), two NVIDIA GPUs (Tesla-V100 and GTX-1080 [Hu et al.,
2018]), and three FPGA designs: FINN [Umuroglu et al., 2017], ReBNet [Ghasemzadeh
et al., 2018], and FP-BNN [Liang et al., 2018]. As for network models and datasets, we use
the well-known AlexNet and VGG-16 for ImageNet and a widely used VGG-like model
( [Hu et al., 2018, Blott et al., 2018, Umuroglu et al., 2017, Liang et al., 2018]) for Cifar-
10. The network structures are listed in Table 5.1. Since FINN adjusts the structure of
the VGG-like network, to make a fair comparison we use the same network as FINN (i.e.,
VGG-Like-FINN) as listed in the last row of Table 5.1.
5.6.1 Ideal Pruning Rate vs Network Accuracy
Recall there are three types of edge pruning in this work: Condition 1 and Condition 2 of
threshold-based edge pruning and Pooling Pruning. Figure 5·13 shows the overall pruning
rates of networks with and without regularization in training with a breakdown of the three
types of pruning and network accuracy using different relaxing factors. For each network
5k pictures selected at random are used to profile the average pruning rates.
For lossless pruning with the non-regularized models (Green bars and lines), i.e., δ = 1,
for VGG-Like the top-1 accuracy is 88.5% and the pruning rate is 27%. For AlexNet and
VGG-16 the top-5 accuracies are 72.7% and 75.5%, respectively, while the pruning rates
are 19% and 42%. When the relaxing factor is decreased, (1) the pruning rates increase
106
1 0.95 0.9 0.85 0.8 0.75














































































































Figure 5·14: Pruning rates at different layers of the non-regularized models
with different relaxing factors. When relaxing factor is 1, threshold relaxing
is disabled and pruning is lossless. conv-l-p refers to the lth CONV layer
followed by max-pooling. f c-l refers to the lth fully connection layer.
almost linearly, especially for VGG-16 and VGG-Like; (2) For all networks observed so
far the accuracy remains nearly unchanged as the relaxing factor is decreased up until a
threshold, which differs by network, where it suddenly decreases substantially (by many
times the change before the threshold). We refer to this threshold as the inflection point.
The reasons of this observation are as follows. (1) When the relaxing factor is larger
than the inflection point, then the relaxed threshold causes more curves to get pruned (for
each neuron - compared with lossless pruning), but the threshold is not relaxed enough to
change the value of neurons. Hence the network accuracy is not affected significantly. The
outlier is AlexNet with large relaxing factors (δ > 0.9). When the relaxing factor used in
AlexNet decreases from 1 to 0.9, neurons start to flip from 0 to 1, leading to increased
occurrence of Condition 1 and pooling pruning, but without hurting the accuracy; this does
107
not happen in the other two networks. This difference comes from the old-fashioned 3×3
max-pooling filter and 11×11 CONV filter used in AlexNet. (2) When the relaxing factor
is smaller than the inflection point, then the neurons’ values start to flip and so incurring
errors.
The relaxing factors at the inflection points of VGG-Like, AlexNet, and VGG-16 are
0.7, 0.85, and 0.9, respectively. The pruning rates of these three networks at their corre-
sponding inflection points increase to 49%, 46%, and 48%, with only 3.3%, 0.9%, and 2.9%
loss of accuracy, respectively. The networks of ImageNet are more sensitive to lowering
the relaxing factors. The reason is that ImageNet has 1000 classification categories, while
Cifar-10 only has 10. The complexity of the classification task affects the vulnerability of
networks, i.e., the networks’ sensitivity to threshold relaxing. VGG-16 is more sensitive
than AlexNet. A possible reason is that the pruning rate of VGG-16 without threshold
relaxing is already close to that at the inflection point of AlexNet. We also measure the
variance in pruning rates among all test images. Error bars are shown at the tops of the
pruning rate bars. It is observed that for different images the pruning rates are stable.
The red, yellow, and blue bars and lines in the figure show the pruning rates and ac-
curacy for the BNN models trained with only the threshold regularization term, only the
pooling regularization term, and both. For lossless pruning, the three regularization ap-
proaches improve the pruning rates to 32%, 30%, and 50%, respectively, with accuracy
loss of only 0.3%, 0.6%, and 0.6%, respectively. The two regularization terms are mostly
independent; this is because they operate on different components of the BNN network.
With the relaxing factor further increased, the pruning rate increases linearly, but more
slowly than the original model without regularization. We therefore observe that better
pruning rates can be achieved by regularization for a relatively larger relaxing factor than a
smaller one. By combining regularization and the relaxing methods, we observe an inflec-



















































































Figure 5·15: Performance and hardware consumption of O3BNN-Rs with
different OoO capabilities and with lossless (without threshold relaxing) or
lossy (with threshold relaxing) pruning for the non-regularized models. 512-
PE O3BNN-Rs are compared to 512-PE baseline without pruning and ideal
design with theoretically perfect pruning. The relaxing factors for lossy
pruning used in VGG-Like, AlexNet and VGG-16 are 0.7, 0.85, and 0.9.
regularization only incurs a very small accuracy loss. However, when the relaxing factor is
smaller than the inflection point (more relaxed), regularization leads to considerable accu-
racy loss. The inflection points of VGG-like, AlexNet and VGG-16 are 0.7, 0.85, and 0.9,
respectively, the corresponding pruning rates at these inflection points are 52%, 49%, and
53%.
To show the effect of different relaxing factors, all layers share the same relaxing factor.
In practice, each layer can use a different relaxing factor. For the first two CONV layers
and the last three FC layers, the relaxing factor can be relatively large because errors in
these layers affect the final classification result more seriously; the CONV layers in the
middle can use small relaxing factors. By doing so, O3BNN-R can obtain higher pruning
rates with less accuracy loss.
Figure 5·14 is similar to Figure 5·13 but shows the pruning rates of the three types of
edge pruning at each layer with the models trained without regularizations. It is observed
that, for all networks, pooling pruning is the most significant pruning type. Condition 2 is
triggered much more frequently than Condition 1 at most of the layers, especially when the
relaxing factor is close to 1. It is also observed that the pruning rates of the FC layers are
109
Table 5.2: Latency, hardware demand, and accuracy of baseline and
O3BNN-Rs with different OoO capabilities (1, 2, 3) and with lossless or
lossy pruning for non-regularized models. Both the baseline design and
O3BNN-R implementations are equipped with 512 PEs. For lossy pruning




capability 1 2 3 1 2 3 1 2 3
Without Latency(µs) 619 609 608 793 774 772 5779 5626 5603
Threshold HardwareDemand 22264 22607 22954 23357 23736 24102 23466 23876 24285
Relaxing Accuracy 88.5% 72.7% 74.3%
With
Relaxing Latency(µs) 430 419 418 565 545 541 5186 5080 5059
Factor at HardwareDemand 22264 22607 22954 23357 23736 24102 23466 23876 24285
inflection
points Accuracy 85.2% 71.8% 71.4%
Latency(µs) 809 899 9251
Baseline HardwareDemand 21930 23005 23056
Accuracy 88.5% 72.7% 74.3%
very low when the relaxing factor is close to 1; however, those rates increase much more
rapidly than the ones at the CONV layers when the relaxing factor decreases.
5.6.2 Hardware Demand versus Performance
By pruning the BNN network dynamically, O3BNN-R is expected to provide better perfor-
mance than a traditional accelerator with no pruning. To evaluate the efficiency of O3BNN-
R, we take the classic BNN inference implementation (described in Section 5.3.1) as the
baseline and compare it with three O3BNN-R designs with different OoO capabilities. In
the baseline design, Loops 1, 2, and 3 (in Figure 5·3) are processed sequentially, while
Loops 4, 5, 6, and 7 are processed in parallel. The architecture of our baseline design is
traditional and similar to the ones used in [Liang et al., 2018, Nurvitadhi et al., 2016]. At
each clock cycle, each PE calculates the value of one curve. Assuming there are NOC×NIC
110
Table 5.3: Cross-platform evaluation of Latency, Energy Efficiency, and
Accuracy: VGG-like-FINN [Blott et al., 2018] for Cifar-10. O3BNN refers
to the design without regularization techniques; O3BNN-R refers to the de-
sign with both regularization techniques.




[Blott et al., 2018]
FPGA ZC706
Frequency (MHz) 200
Latency (µs) 283 167 116 (lossy) 153 107 (lossy)
EE (Img/kJ) 3.9E5 6.65E5 9.58E5 7.16E5 10.38E5
Accuracy 80.1% 82.6% 79.3% 82.1% 78.5%
PEs, at each cycle NOC neurons with the same coordinate are completely evaluated. After
Width×Height cycles, a layer is processed completely and processing begins on the next
layer.
This baseline design is standard and widely used in the DNN literature. For a fair com-
parison of hardware consumption and performance, the baseline design is equipped with
the same number of PEs as O3BNN-R. Compared with the O3BNN-R architecture, the
baseline design has similar DSF and simpler PEs: they do not have logic to support prun-
ing and OoO processing, e.g., the circular communication network and pending buffer for
horizontal rotation, comparators for redundancy check, and control logic for OoO schedul-
ing and edge pruning. Also, there is no Scoreboard in the baseline design, which uses static
in-order scheduling.
The O3BNN-Rs are also compared with respect to ideal performance, i.e., the perfor-
mance of an ideal system that is able to exploit all pruning opportunities profiled in Section
5.6.1, and without any bubbles in the pipeline incurred by dynamic scheduling. The per-
formance differences between the ideal performance and the O3BNN-Rs indicate the OoO
processing efficiency of the proposed architecture. For each O3BNN-R design, two perfor-
mance values are given: one for lossless pruning and the other one for lossy pruning using
the relaxing factor at the inflection point in Figure 5·15.
In Figure 5·15, the blue and orange lines indicate the latencies using lossless and lossy
111
Table 5.4: Cross-platform evaluation of Latency, Energy Efficiency, and
Accuracy: VGG-like [Courbariaux et al., 2015] for Cifar-10. O3BNN refers
to the design without regularization techniques; O3BNN-R refers to the de-
sign with both regularization techniques.




[Liang et al., 2018]
GPU V100
[Li et al., 2019a]
FPGA ZC706
Freq (MHz) 2.5K 1.37K 200
Latency (µs) 1.36E6 994 609 419(lossy) 563 388(lossy)
Energy(Img/kJ) 7.79 5543 1.82E5 2.65E5 1.99E5 2.83E5
Accuracy 86.31% 89.9% 88.5% 85.2% 88.2% 83.9%
Table 5.5: Cross-platform evaluation of Latency, Energy Efficiency, and
Accuracy: AlexNet [Krizhevsky et al., 2012] for ImageNet. O3BNN
refers to the design without regularization techniques; O3BNN-R refers
to the design with both regularization techniques. Our results are com-
pared with 3 existing works with GPU V100 [Li et al., 2019a], FPGA
VCU108 [Ghasemzadeh et al., 2018], and FPGA Stratix-V [Liang et al.,
2018].










Freq (MHz) 1.37K 200 150 200
Latency (µs) 2226 1920 1160 774 545 (lossy) 661 510 (lossy)
Energy (Img/kJ) 2475 2.7E4 3.3E4 1.44E5 2.04E5 1.67E5 2.13E5
Accuracy 71.2% N/A 66.8% 72.7% 71.8% 72.1% 70.6%
pruning, respectively. We use non-regularized models as examples to evaluate the pruning
efficiency of the O3BNN-R architectures. As for the regularized models, their hardware
resource demands are the same as the ones for original models and their performances
are listed in Table 5.3-5.6. Without any pruning, the inference latencies of VGG-Like,
AlexNet, and VGG-16 are 809, 899, and 9251 µs, respectively. The hardware consumption
is 21930, 23005, and 23056 Configurable Logic Block (CLBs). Using a O3BNN-R design
whose OoO capability is 1, i.e., the Scoreboard can track the status of 1×NOC curve-group
at a time, the inference latencies of these three networks are decreased to 619, 793, and
5779 µs when using lossless pruning (δ = 1), and 430, 565, and 5186 µs when relaxing
factors at the inflection points are used. The hardware overheads are only 1.5%, 1.5%, and
112
Table 5.6: Cross-platform evaluation of Latency, Energy Efficiency, and
Accuracy: VGGNet-16 [Simonyan and Zisserman, 2015] for ImageNet.
O3BNN refers to the design without regularization techniques; O3BNN-R
refers to the design with both regularization techniques.





[Hu et al., 2018]
GPU
GTX1080
[Hu et al., 2018]
FPGA ZC706
Frequency (MHz) 1.3K 1.61K 200
Latency (µs) 1.18E4 1.29E4 5626 5080 (lossy) 4877 4603 (lossy)
Energy (Img/kJ) 395 433 1.97E4 2.19E4 2.23E4 2.37E4
Accuracy 76.8% 76.8% 74.3% 71.4% 73.7% 70.1%
1.8% compared with the baseline design. The performance of lossless and lossy O3BNN-
Rs whose OoO capability is 1 are, on average, only 4.4%, and 6.5% lower than the ideal
ones. The difference between ideal performance and the performance of lossy O3BNN-R
is larger than the one between ideal and lossless O3BNN-R. The reason is that the pruning
rates using lossy pruning are much larger than the ones using lossless pruning, requiring
more OoO capability.
When the OoO capability of O3BNN-R increases from 1 to 2, for lossless pruning, the
latencies are reduced by 1.5%, 2.5%, and 2.6%, respectively for the three networks, and
reach 609, 774, and 5626 µs. The hardware demand is slightly increased, i.e., by 1.5%,
1.5%, and 1.7%. For lossy pruning with the relaxing factors at the inflection points, the
latencies are reduced by 2.6%, 3.7%, and 2.1% and reach 419, 545, and 5080 µs. The
performance of O3BNN-R with OoO capability of 2 is, on average, only 5% lower than
the ideal. When the OoO capability is increased from 2 to 3, there is almost no further
performance improvement, but the hardware demand increases on average by 1.6%.
The latency, hardware demand, and accuracy of baseline and O3BNN-R-based BNN
implementations are summarized in Table 5.2. As stated in Section 5.4.4, a larger Score-
board can track the processing status of more curve-groups. The more unbalanced the
pruning timing of different curve-groups, the larger the Scoreboard that is needed to avoid
113
pipeline bubbles caused by a fully occupied Scoreboard. According to the experimental
results, the support of OoO processing of 2×NOC curve groups is already sufficient to
unbalance the edge pruning timing.
In addition to the pipeline bubbles incurred by dynamic scheduling, another potential
reason for the performance gap between the theoretical shortest latency (i.e., ideal latency)
and the actual measured latency of O3BNN-R is the difference of the pruning granularity.
The pruning granularity in the profiling of Section 5.6.1 is per-edge, which is finer than the
pruning granularity of the O3BNN-R implementation, which is by curve or K×K edges.
Thus, for each neuron, the real implementation may compute at most K ×K − 1 extra
edges than the profile, leading to at most 1/NIC extra overhead. Considering that NICs are
usually larger than 100, this overhead is very small. We take this granularity overhead
into consideration when evaluating O3BNN-Rs, but because of their small size they are not
specifically marked in Figure 5·15.
5.6.3 Cross-platform Evaluation
In Table 5.3-5.6, O3BNN-R’s performance, energy efficiency, and accuracy (with lossless
and lossy pruning) are compared with existing and self-implemented systems using various
CPUs, GPUs, and FPGAs to accelerate BNN inference of VGG-like, VGG-like-FINN,
AlexNet, and VGG-16. The performance is evaluated by using the latency of single-image
inference. The energy efficiency is evaluated with respect to image inferences per Kilo-J
(Image/kJ). Based on the trade-off analysis of Section 5.6.1 and 5.6.2, the OoO capability of
O3BNN-Rs is set to 2. For lossy pruning where threshold relaxing is enabled, the relaxing
factors, δ, applied in the inference of VGG-Like, VGG-Like-FINN, AlexNet and VGG-
16 are 0.7, 0.7, 0.85, 0.9, respectively; i.e., the δ at inflection points of accuracy lines in
Figure 5·13. For lossless pruning, threshold relaxing is not enabled (δ = 1).
We compare O3BNN-Rs with some recently published well-known FPGA-based
BNNs. Compared with FINN [Blott et al., 2018, Umuroglu et al., 2017], using the same
114
network model (i.e., VGG-Like-FINN as shown in the last row of Table 5.1), training strat-
egy (SGD without the proposed regularizations), and FPGA board (i.e., ZC706), lossless
and lossy O3BNN demonstrate 167 µs and 116 µs single-image inference latency, corre-
sponding to speedups of 1.69× and 2.44×. For accuracy, the lossless approach is 2.5%
better than FINN, while the lossy approach is only 0.8% lower than FINN.
AlexNet results of O3BNN-R are compared to FP-BNN [Liang et al., 2018] and
ReBNet [Ghasemzadeh et al., 2018]. The latencies of AlexNet inference accelerated by
O3BNN-R with lossless and lossy pruning are 774 and 545µs. Compared to FP-BNN,
lossless and lossy O3BNN-Rs are 1.50× and 2.13× faster. Compared to ReBNet, lossless
and lossy O3BNN-Rs are 2.48× and 3.52× faster. Note that the FPGA boards used in FP-
BNN (Stratix-V) and ReBNet (VCU108) are high-performance FPGA and have hardware
resources around 4× and 2.5× as much as the one (ZC706, an embedded-FPGA) used in
our evaluation. With regard to energy efficiency, lossless and lossy O3BNN-Rs are 4.4×
and 6.2× better than FP-BNN and 5.3× and 7.6× better than ReBNet. The accuracy of
our lossless and lossy AlexNet implementations are both higher than the one reported in
FP-BNN. The accuracy of AlexNet is not reported in ReBNet.
We also measure the inference latency of VGG-16 for ImageNet, which is 5.6ms and
5.1ms for a single image using lossless and lossy O3BNN-Rs, respectively. The accuracy
is 74.3% and 71.4%. Compared with CPUs and GPUs, the latencies for lossless and lossy
O3BNN-R are 47.7% and 43.1% of the latency for Xeon Phi 7210 [Hu et al., 2018], and
43.6% and 39.4% of the latency for GTX1080 [Hu et al., 2018], with similar accuracy. As
the FPGA board in this work is for embedded applications, the energy advantage is even
more prominent. The energy efficiency of O3BNN-Rs is 50× and 55× better than that
of the Xeon-Phi 7210 and 45× and 51× better than the GTX-1080 for lossless and lossy
pruning designs. The comparisons of other networks and with other CPUs and GPUs are
listed in Table 5.4 and Table 5.5.
115
By applying the proposed regularization techniques, the pruning efficiency of O3BNN-
R is further improved with almost negligible loss in accuracy. Compared with existing
work, regularized lossless and lossy O3BNN-R achieve 1.85× and 2.64× speedups over
FINN, 1.75× and 2.27× over FP-BNN [Liang et al., 2018], and 2.90× and 3.76× over
ReBNet [Ghasemzadeh et al., 2018]. Compared with the models without regularization,
the improvements are, on average, 15% for lossless pruning and 8% for lossy pruning.
5.7 Related Work
BNNs have been implemented variously [Nurvitadhi et al., 2016, Umuroglu et al.,
2017,Zhao et al., 2017b,Liang et al., 2018,Hu et al., 2018,Ghasemzadeh et al., 2018,Geng
et al., 2019b, Geng et al., 2019a, Li et al., 2019a]. Because of the flexibility and direct
bit-manipulation capability of FPGAs [Geng et al., 2018, George et al., 2016], most BNN
implementations are FPGA-based [Nurvitadhi et al., 2016, Umuroglu et al., 2017, Zhao
et al., 2017b, Liang et al., 2018, Ghasemzadeh et al., 2018]. We have already discussed
FINN [Umuroglu et al., 2017] in Section 5.2. In [Zhao et al., 2017b], Zhao, et.al., pro-
posed the first high-level-synthesis-based BNN implementation on FPGAs. In [Liang et al.,
2018], Liang, et.al., proposed an FPGA-based BNN accelerator that drastically cuts down
the hardware consumption by using resource-aware model analysis. Recently a CPU-based
BNN design was proposed [Hu et al., 2018] that relies on bit-packing and AVX/SSE vector
instructions to achieve good bit-processing performance. All of these are static designs and
none takes advantage of the pruning opportunities of BNNs.
With regard to the pruning of BNNs, multiple studies have described BNN edge and
neuron pruning. We have already discussed the neuron pruning work [Fujii et al., 2018]
in Section 5.2. In [Khoram and Li, 2018], Li, et. al., proposed a new training method
for BNNs in which bit-level accuracy sensitivity analysis is conducted after initial training.
The channels with low accuracy sensitivity are then pruned. These pruning methods are all
116
performed offline and before inference. During inference, the designs are entirely static.
Also, the network accuracy is often compromised due to the pruning of neurons or edges.
Our method–in contrast to the static and offline pruning approaches–is dynamic with on-
line pruning of inference at run-time. Without a relaxing factor, this method can prune a
large number of edges without affecting the accuracy of the networks.
Compared with the studies published on CNN pruning [Yang et al., 2017b, Molchanov
et al., 2016,He et al., 2017], the design here has three distinguishing aspects: (1) Run-time
dynamic pruning for post-training network models; (2) Without compromising accuracy,
no fine-tuning at training process and no need to retrain models; (3) 2D-rotative OoO-
architecture to handle irregular parallelism from run-time dynamic pruning.
5.8 Discussion
Generality of O3BNN-R: O3BNN-R is generally useful for any Quantized Neural Net-
works (QNNs) but is especially efficient when the QNN’s feature data-width is ≤ 4-bit.
For a QNN layer with q-bit features, each output channel has at least 2q−1 thresholds.
To check the triggering condition of threshold-based pruning, partial accumulation results
need to be compared to q thresholds. Compared with BNNs with 1-bit features, each PE of
O3BNN-R needs to perform q−1 extra comparisons per cycle, leading to increased usage
of computation logic. Furthermore, more thresholds need to be stored, leading to higher
on-chip memory demand. Based on our experiments, this hardware cost overhead is negli-
gible for QNNs with ≤ 4-bit features so O3BNN-Rs can provide significant speedups. For
QNNs with wider data-widths, O3BNN-R needs to be optimized to reduce the hardware
resource overhead; this is future work.
Could this approach work for GPUs? First, most GPUs still follow the SIMT warp- or
wavefront-based execution model and thus cannot dynamically switch-in/out tasks at per-
lane granularity. Second, the out-of-order capability enabled in the O3BNN-R design relies
117
on the high flexibility of the architecture, while the execution of GPU threads is in-order. It
may therefore be difficult for existing GPUs to effectively leverage dynamic pruning with
its randomly occurrences.
5.9 Conclusion
We propose O3BNN-R, an OoO high-performance BNN inference architecture with fine-
grained and dynamic pruning. The contributions of O3BNN-R are two-fold. For algorithm,
O3BNN-R demonstrates the highly-condensed BNN model can be further shrunk signif-
icantly without loss on accuracy by dynamically pruning irregular redundant edges at all
CONV, FC, and POOLING layers. For architecture, O3BNN-R is an out-of-order archi-
tecture which (1) checks the redundancy of operations at run-time and in a fine-grained-
manner; (2) avoids these redundant operations and ceases the evaluation of a neuron in
case its binary output can be determined early; and (3) schedules the evaluation workload
of neurons to hardware in a 2D-rotative OoO scheduling methodology with almost perfect
utilization. Furthermore, to further enhance the pruning rate, we proposed 2 regulariza-
tion techniques to direct the models training towards the direction leading to more pruning
opportunities in our O3BNN-R architecture. We have evaluated our design on an FPGA
platform using VGG-16, AlexNet for ImageNet, and a VGG-Like network for Cifar-10.
Results show that our out-of-order approach can and prune 27%, 19%, and 42% of the
redundant operations for the three networks respectively, without any accuracy loss, lead-
ing to, at least, 1.7×, 1.5×, 2.1× inference-speedup over state-of-the-art FPGA/GPU/CPU
BNN implementations. With only 3.3%, 0.9% and 2.9% accuracy loss, the pruning rate
increases to 49%, 43%, 48%, respectively, with, at least, 2.4×, 2.1×, and 2.3× speedup.
The results also show that the combination of our 2D-rotative dynamic scheduling tech-
nique and the novel OoO architecture can handle operation-level irregularity efficiently.
Although O3BNN-R is designed for BNNs, the proposed techniques are generally useful
118
to handle the operation-level irregularity in any types of NN.
119
Chapter 6
CQNN: a CGRA-based architecture for
Mixed-precision DNNs
This chapter introduces the architecture of CQNN (CGRA-based QNN accelerator) which
supports the acceleration of mixed-precision DNNs. CQNN is based on CGRA architecture
and is equipped with various binary function units which can be freely integrated to support
QNN functions at runtime. By so doing, CQNN is able to handle the bit-level irregularity
well. This chapter is based on the work presented in 2020 IEEE High Performance Extreme
Computing Conference (HPEC) ©2020 IEEE [Geng et al., 2020].
6.1 Introduction
DNNs have the potential to be adopted in the real world due to their ability to analyze latent
information from both structured [Ji et al., 2013, Karpathy et al., 2014, Geng et al., 2020]
and unstructured data [Geng et al., 2020] and to achieve high accuracy through learning
[Geng et al., 2018, Wang et al., 2020]. However, many real-world applications, especially
for IoT and smart-edge applications, are highly cost-, power-, and latency-restricted.
DNN applications are both computation and communication intensive making it chal-
lenging to use them and meet the restrictions. One approach is to squeeze the model by
using fewer bits to represent features and parameters. The extreme case is BNNs, which
use only a single bit to represent a feature or parameter [Rastegari et al., 2016]. BNNs
have been demonstrated to have great potential in cost- and power-restricted domains, but
have not been widely adopted in real-world applications due to their significant loss in ac-
120
curacy. A promising compromise, called QNNs, is to use mixed precision, e.g., from 1-bit
to 8-bits, and to vary precision across layers and channels. This is done in such a way
so as to find the optimal balance between performance and accuracy [Micikevicius et al.,
2017,Hubara et al., 2017]. QNNs have been found to dramatically reduce computation and
communication requirements with negligible loss in accuracy.
To meet requirements of different applications, models are trained with various hyper-
parameters including the number of layers, number of channels per layer, and number of
bits used at each layer and channel. It is challenging to design an accelerator that can
work efficiently with any combination of these hyper-parameters, especially when various
numbers of bits used. Most existing architectures are designed for networks with specific
configurations. When the model configurations change, the architecture needs to be re-
implemented offline. This kind of accelerator guarantees high efficiency, but has poor
flexibility. Some other accelerators are designed in more general ways, e.g., to support
programming by users for different QNNs. These designs provide good flexibility, but
often lose efficiency due to their generalized architectures.
In this chapter, we propose a novel CGRA-based QNN acceleration framework, CQNN.
Taking advantage of the CGRA architecture, CQNN provides both high performance and
good flexibility in the processing of mixed-precision QNNs. By programming CQNN at
runtime, the architectures of the processing elements of CQNN can be dynamically recon-
figured as the best match to the QNN models under processing. The proposed framework
includes (1) compiler which generates the instructions to reconfigure the CGRA Network
on Chip (NOC) at runtime, (2) binary-component-based CGRA architecture which can be
configured to support QNN functions with different hyper-parameters, (3) a cycle-accurate
simulator for quick performance evaluation, and (4) an RTL generator for fast implemen-
tation. CQNN supports mixed-precision QNNs. In CQNN, all basic units are designed
to support binary operations. Multiple binary units are integrated at runtime to support
121
the processing of QNN operations with various data-widths with negligible hardware over-
heads.
The contributions of this work are summarized as follows:
• We propose CQNN, a CGRA-based QNN inference acceleration framework. The
proposed framework is efficient for QNNs with any model configuration.
• We propose a binary-based CGRA architecture that can be dynamically reconfigured
at runtime to support the kernel execution with various data widths efficiently.
• We evaluate our design on Xilinx Ultra-scale+ VCU118 FPGA development board.
With the proposed design, the inference of AlexNet and VGG-16 can be completed
within 0.13ms and 2.63ms, respectively.
The organization of this chapter is as follows. In Section 6.2, the related works about
mixed-precision QNN acceleration are introduced. In Section 6.3, we discuss how QNN
modules in CQNN are realized by integrating BNN components. In Section 6.4, CQNN
architecture, its NOC configuration and compilation support are presented. In Section 6.5,
experimental results are given and discussed. Finally, we conclude and suggest further
work in Section 6.6.
6.2 Related Works
Quantization of DNNs has been well studied. Besides the already mentioned XNOR-Net
[Rastegari et al., 2016] and BWN [Courbariaux et al., 2015], Zhou et al. proposed the
DoReFa network which clips activations to improve the utilization of quantization levels
[Zhou et al., 2016]. The top-1 accuracy loss of AlexNet is only 6% using 1 bit and 2 bits to
represent parameters and features respectively. Miyashita et al. proposed a logarithm-based
quantization and demonstrated that using 4-bit parameters and 5-bit features only incurs
1.7% loss in top-5 accuracy for AlexNet [Miyashita et al., 2016]. Park et al. proposed a
122
more advanced quantization method demonstrating that ResNet-101 with 5-bit parameters
and 6-bit activations has comparable accuracy to the full-precision network [Park et al.,
2018b].
There has also been work on accelerators for QNNs. Wang et al. proposed a hardware-
aware automated quantization framework, HAQ [Wang et al., 2019a]. Park et al. pro-
posed an energy-efficient QNN accelerator based on outlier-aware low-precision compu-
tation [Park et al., 2018a]. Umuroglu et al. proposed a flexible heterogeneous streaming
architecture for a fast, scalable, and flexible FPGA accelerator for BNNs [Umuroglu et al.,
2017]. Tong et al. proposed an out-of-order architecture, O3BNN, which prunes redundant
operations at runtime during inference [Geng et al., 2019a, Geng et al., 2021]. These de-
signs provide high performance, but require re-implementation to efficiently support QNNs
with various hyper-parameters. Another study that has ideas related to CQNN is Bit Fu-
sion. It does not focus on QNNs, however, and the proposed techniques are not applied to
pooling, activation, or BN kernels.
6.3 BNN Module Integration for QNN
In this section, we discuss how to build QNN modules with multiple binary components.
Three QNN modules are introduced: Quantized CONV (Q-CONV), QT-BN, and Quan-
tized POOLING (Q-POOL).
6.3.1 Q-CONV
The Q-CONV modules perform the QMUL and QACC functions. At each cycle, a Q-
CONV module performs K ×K ×NIC multiplication operations and accumulates their
results in a pipelined manner. The accumulation result is then forwarded to a QT-BN
module for quantization and then a complete output feature is calculated. K is the CONV
window size and NIC the number of input channels.
123
In CQNN, each Q-CONV module is composed of multiple bit-level CONV (BCONV)
components. Each BCONV is in charge of conducting binary CONV operations of 32 in-
put channels and their POPCOUNTs. At each cycle, one BCONV component performs
K×K× 32 1-bit×1-bit multiplications (i.e. bit-and) and their accumulations (i.e. POP-
COUNT). Figure 6·1 illustrates the design of a Q-CONV module for a 2-bit×3-bit QNN
layer with NIC = 64 and K = 3. This Q-CONV module is made up of 12 BCONV compo-
nents.
To calculate an output feature, 576 2-bit×3-bit multiplications need to be performed.
The calculation of 64 input channels is mapped to 2 groups of BCONV components. Each
group handles 32 input channels, i.e 288 multiplications. Each multiplication can be further
divided into 6 bit-and operations which are mapped onto 6 BCONV components in the
same BCONV group. Each BCONV component therefore calculates 288 bit-and operations
in each cycle. Figure 6·1 illustrates how to transform a 2-bit×3-bit multiplication into 6
bit-and operations and map them to 6 BCONV modules. To calculate all output features,
intermediate results of each of these 12 BONV components needs to be reduced. In this
design, intra-group result reduction is performed with add-after-shift operations (from the
most significant bits to least significant bits); inter-group reduction is realized by summing
the intra-group reduction outputs.
6.3.2 QT-BN Module
As mentioned in Chapter 2, QT-BN is performed by comparing the QCONV results to
multiple thresholds. For a QNN layer with q-bit features, each output channel has at least
2q−1 thresholds. To calculate each quantized output feature, a QCONV result needs to be
compared to q thresholds. In CQNN, each QT-BN is implemented with q T-BN modules.
Each T-BN module compares the QCONV result to one threshold and the comparison out-
come determines one bit of the QT-BN result. T-BN modules work in a pipelined manner
and q modules determine the final value of the quantized feature collaboratively.
124
Figure 6·1: QCONV for a QNN layer with 2-bit features and 3-bit parame-
ters and the transformation from a 2-bit×3-bit multiplication into 6 bit-add
operations.
CQNNs mainly focus on QNNs with 1-bit to 6-bit features and parameters. Figure 6·2
illustrates the structure of a TQ-BN module for a QNN layer with 3-bit features. Quantized
features have 8 possible values, i.e. 0 - 7. The value is determined by comparisons between
a QCONV result and 7 thresholds, i.e. T0 - T6 in Figure 6·2. The first TQ-BN determines
the most significant bit of the final output, while the last one decides the least significant
bit. Each T-BN has a threshold table for storing the thresholds required to generate the
quantized features. Each T-BN has 2 inputs, the address for threshold table access and the
QCONV output for comparison, and 1 output, 1-bit of quantized output feature.
6.3.3 Q-POOL
Q-POOL (MAX-Pooling) modules are composed of multiple bit-based pooling compo-
nents (BPOOL) working in a pipelined manner. Each Q-POOL is connected to a QT-BN
and consumes the features generated by the QT-BN. For a QNN layer with q-bit features, a
Q-POOL module has q BPOOLs. The q BPOOLs and q T-BNs are connected one-to-one.
125
Figure 6·2: QT-BN architecture for a QNN layer with 3-bit features.
All BPOOLs and T-BNs work in a 2-level pipeline. Different BPOOLs compare different
bits of the quantized features and send comparison outcomes to their successors for further
comparison if necessary. Figure 6·3 illustrates the structure of a Q-POOL module for a
QNN layer with 3-bit features. Each BPOOL has an output called EN which is used to
activate the successive BPOOLs. The EN signal is 0 when a BPOOL finds the incoming
bit to be smaller than the one currently stored in its MAX Value buffer. In this case fur-
ther comparisons and updates are not needed and its successors will neither compare the
newly coming data nor update the max value of the bit they are working on. Each BPOOL
calculates one bit of the Pooling results.
6.4 Design of CQNN Framework
This section introduces the CQNN framework, which includes hardware architecture, com-
piler, cycle-accurate simulator, and RTL generator. We begin with an overview, the present
details, and finally discuss how the compiler works with the architecture for NoC configu-
ration.
126
Figure 6·3: Q-POOL architecture for a QNN layer with 3-bit features.
6.4.1 Framework Overview
Figure 6·4 illustrates the framework of CQNN. The compiler is used to generate instruc-
tions to reconfigure CGRA NOC at run time. Instruction generation is based on the model
configurations per layer and the hardware constraints, e.g. dimensions of target CGRA
arrays. After instructions are generated for an entire QNN model, they are stored in the
Control Processor (CP). During processing, the CP fetches and decodes the instructions
and generates signals to configure the CGRA network. The RTL generator is used to gen-
erate the Verilog-based hardware. A cycle-accurate simulator provides fast and accurate
performance evaluation.
6.4.2 Architecture
Figure 6·5 illustrates the architecture of CQNN. CQNN has 2 main modules: CGRA array
(CGRA) for inference calculation and Control Processor (CP) for CGRA NOC run-time
configuration. We first introduce the architecture.
127
Figure 6·4: Framework of CQNN
CGRA Array
A CGRA Array consists of a reconfigurable NOC, Parameter Scheduler (PS), Feature
Scheduler (FS), and computation components including BCONVs, T-BNs, and BPOOLs.
During the processing of a certain layer, FS performs two tasks. First, it prefetches input
features from off-chip memory and forwards those features which are going to be consumed
in the coming iterations to BCONVs in the expected order. Second, it receives output fea-
tures calculated in the current iteration from BPOOLs, selects valid outputs, and writes
them back to off-chip memory. In parallel, the Switch Reconfiguration Controller (SRC) at
CP sends reconfiguration signals of the next layer to PS, WS, and NOC. These signals are
cached in these 3 modules with double buffering where they await the completion of the
layer being processed. After PS receives the reconfiguration signals, it begins prefetching
weights, biases, and thresholds of the next layer and sends them to the BCONVs and T-BN
engines.
At the completion of the layer under processing, the NOC is reconfigured based on the
signals received previously so that a new configuration that matches the requirements of
128
Figure 6·5: The overall architecture of CQNN including CP and CGRA
array. Black blocks are switches. Each Switch connected to a red block is
equipped with an accumulator and a left shift logic. The blocks covered by
the gray window can be integrated as an engine for 2-bit×5-bit QNN layers.
the coming layer is realized. In this design, in order to reduce the reconfiguration time
and routing complexity of the reconfiguration network, the CGRA NOC is reconfigured
by column from right-most to left-most. Assuming that the left-most column receives the
reconfiguration signals at one column per cycle and that there are c columns in the CGRA,
at cycle c, all columns have received their reconfiguration signals and are reconfigured
simultaneously. In this design, the number of columns that can be configured at each
cycle can be customized. More columns per cycle means faster configuration but a more
expensive networks. In this implementation, the SRC configures 1 column per cycle.
As mentioned in Section 6.3, in this CGRA architecture, all computation components
129
Figure 6·6: CGRA Array details and 3 types of integration for a QNN en-
gine for a layer with 32 input channels, 3-bit features, and 2-bit parameters.
(A) Default configuration implements a QNN engine by grouping binary
components in a 6×6 raw CGRA array. (B) & (C) Implements QNN en-
gines by grouping binary components in a 3×12 & 12×3 raw CGRA arrays.
All components are pipelined.
are designed to support binary functions, but, through run-time NOC reconfiguration, can
be efficiently and easily integrated into modules with various data-widths. Figure 6·6 il-
lustrates 3 types of configurations for a motivating QNN engine (a 3-bit×2-bit QNN layer
130
Figure 6·7: Engine mapping with 3 types of configuration
with 32 input channels).1 Each of these engines is created by integrating 36 binary compo-
nents which work collaboratively to calculate one complete output feature per cycle. These
3 types of configurations realize the engines with the same functionality but have differ-
ent shapes. These configurations can be combined to fit on a CGRA with a certain size
to maximize the number of engines that can be adopted. As shown in Figure 6·7, we use
the default configuration to map QNN engines onto a CGRA array until there is insuffi-
cient space remaining. Horizontal and vertical configurations are then used to fill up the
remaining rows and columns. The mapping scheme is generated offline by the compiler.
As shown in Figure 6·6, in a QNN engine, BCONVs are always fully utilized, but only
25% T-BNs and BPOOLs are activated. This overhead is inevitable to support any type of
QNN layers dynamically at runtime. The overall overhead, however, is very small as the
hardware resources of T-BNs and BPOOLs are only 9.1% and 2.9% of BCONVs.
As for the hardware support for parameter and feature on-chip storage and data for-
warding, we use similar designs to the ones used in LP-BNN [Geng et al., 2019a]. Param-
eters and features at multiple channels are packed and stored in line buffers for easier data
1a QNN engine is defined as a group of modules that can produce one complete output feature per cycle
131
Figure 6·8: Instruction structure
movement and reuse.
Control Processor
As mentioned above, the CGRA configuration is determined offline by the compiler, and
configuration signals are generated by the CP. During compilation a series of instructions
is generated based on the model configurations and CGRA dimensions. Each instruction
describes the CGRA configuration for one QNN layer. Figure 6·8 shows the instruction
structure. The instructions are stored in Instruction Memory (IM). At the start of processing
of a certain layer, Instruction Fetch and Decode (IF & ID) access the instruction for the
next layer from IM and send the decoded information to SRC for configuration signal
generation. Further details are omitted due to limited space.
6.4.3 Compiler
The compiler calculates the maximum numbers of default, horizontal, and vertical QNN
engines that can be implemented on the target array; maps these QNN engines onto the
array; and generates instructions for all layers. A greedy algorithm is used to determine
the mapping strategy. For a particular layer, the compiler first determines the numbers of
rows and columns occupied by each default, horizontal, or vertical engine. The compiler
then maps as many default engines as possible from the upper left corner to the lower right
corner of the CGRA array. The remaining space is filled up with horizontal and vertical
engines. At this point the compiler has enough information to generate an instruction for a
layer.
132
Table 6.1: Execution latency (µs) and BCONV utilization of QNN layers
with different numbers of input channels (NIC) and data-widths (DW) of
features and parameters. All layers have 2×2 MAX-Pooling and 128 output
channels. Image size is 128 × 128.
Design CQNN with 64x48 CGRA Array
Freq 300MHz
Device Xilinx VCU118 FPGA (67% LUTs and 40% Flip-Flops)














































































In this section, we evaluate the performance of CQNN instantiations. We implement a
CQNN with a 64×48 CGRA array (64×16 each of BCONVs, T-BNs and BPOOLs) on
a Xilinx VCU118 FPGA. We use Vivado Design Suite 2019.1 for design synthesis, im-
plementation, and bit-file generation. The QNN models are implemented in Pytorch and
trained with a high-end CPU Intel Xeon E5-2680v3 and an NVIDIA Tesla V100 GPU.
6.5.1 Performance with different data-widths
We first evaluate the CQNN performance working with QNN layers with different data-
widths and numbers of input channels. Table 6.1 shows the latency results and hardware
utilization of BCONV components. All layers have 128 output channels and 2×2 MAX-
Pooling. The size of Input feature maps is 128×128. The CQNN with 64×48 CGRA
array consumes 358K (67%) LUTs and 427K Flip-Flops (40%) of the FPGA chip. Results
show that the overall utilization of CQNN is over 90% for most of the QNN layers. With a
certain CQNN, the execution latency increases linearly with the increase in the QNN layer
133
Table 6.2: Cross-platform comparison and evaluation: inference latency in
ms, energy efficiency in img/KJ. CQNNs are compared to existing FPGA
QNN works (e.g. Stratix-V [Liang et al., 2018], VCU108 [Ghasemzadeh
et al., 2018], ZC706 [Geng et al., 2021], and KCU1500 [Geng et al., 2019a])
and a GPU TensorFlow-based implementation [Li et al., 2019a].
GPU FPGA
Device P100 V100 Stratix-V VCU108 ZC706


















Latency (ms) 114.65 2183.62 131.09 1164.95 1.16 1.92 0.61 5.63
Energy 29.1 1.52 42.1 4.93 3.31E4 2.72E4 1.99E5 2.23E4
FPGA FPGA: Our Design



















Latency (ms) 0.27 4.19 0.10 0.13 2.63 0.0081 0.0096 0.19
Energy 1.15E5 6.87E3 3.70E5 2.65E5 1.27E4 4.94E6 4.01E6 1.98E5
workload. The end-to-end latency reported in Table 6.1 includes instruction fetch, decode,
SRC signal generation, feature and parameter read, write from and to off-chip memory, and
QNN layer processing. Offline compilation is not included.
6.5.2 Cross-platform Comparison
We compare the performance of CQNN to TensorFlow-based QNN implementations on
NVIDIA Tesla V100 and P100 GPUs. We use the Vgg-like network [Courbariaux et al.,
2015] of Cifar-10, AlexNet [Krizhevsky et al., 2012], and VGG-16 [Simonyan and Zis-
serman, 2015] of ImageNet as benchmarks (Table 6.3). We also compare our results to
existing FPGA-based QNN accelerators designed for specific NN models; this shows, in
addition to better flexibility, higher performance as well (Table 6.2). Admittedly, we do
not claim CQNNs outperform existing designs for QNNs since the improvement of per-
formance may be due to a larger board and higher clock frequency. Still, the comparison
confirms that the CQNN design does not sacrifice performance to achieve high flexibil-
ity. With CQNN, the inference of Vgg-like, AlexNet, and VGG-16 with 4-bit features and
3-bit parameters takes only 0.10ms, 0.13ms, and 2.63ms, respectively. The inference of
134
Table 6.3: Structures of the networks used to evaluate CQNN.










binary Vgg-like, AlexNet, and VGG-16 take only 8.1µs, 9.6µs and 190µs, respectively.
Board-level power consumption is measured with a power meter.
6.6 Conclusion
We propose CQNN, a CGRA-based acceleration framework for QNNs. The architecture of
CQNN is composed of a programmable control processor, binary components for CONV,
BN, Pooling kernels, and reconfigurable NOCs. The control processor reconfigures the
NOCs and integrates the binary components at runtime to realize the optimal designs for
QNN layers being processed. CQNN has compilation, simulation, and RTL generation
support for fast implementation and evaluation. Experimental results show the proposed
CQNN framework can efficiently handle bit-level irregularity of mixed-precision DNNs




FPDeep: Scalable Acceleration of CNN Training
on Deeply-Pipelined FPGA Clusters
This chapter introduces FPDeep (FPGA-based Deep Neural Network Training) framework
which supports scalable, high-accuracy, and high-performance CNN training. FPDeep real-
izes CNN training as a fine-grained pipeline on a 1D array of FPGAs and is equipped with
graph-based workload partitioning to map CNN training logic distributively onto FPGA
clusters with balanced workloads. This chapter is based on the works presented in the 26th
Annual International Symposium on Field-Programmable Custom Computing Machines
(FCCM) ©2018 IEEE [Geng et al., 2018], the 28th International Conference on Field Pro-
grammable Logic and Applications (FPL) ©2018 IEEE [Geng et al., 2018], Transactions
on Computers (TC) ©2020 IEEE [Wang et al., 2020].
7.1 Introduction
Deep convolutional neural networks (CNNs) have revolutionized applications such as im-
age classification and object recognition [Li et al., 2019a, Sun et al., 2016, Zhao et al.,
2017a, Venkataramani et al., 2017, Geng et al., 2020, He et al., 2016]. But as there remains
an open-ended demand for more complex networks and larger datasets, new computing so-
lutions are critical. A challenging problem is that while large training sets are necessary for
good generalization, they are also more computationally expensive. Therefore, nearly all
of these neural networks are powered by the stochastic gradient descent algorithm (SGD).
Traditionally, distributed synchronous stochastic gradient descent (D-SGD) has enabled
136
large-scale CNN training by partitioning SGD mini-batches into smaller data batches that
can then be processed in parallel and so accelerate CNN training [Goyal et al., 2017]. A
drawback of this method is scalability: to enable continued high utilization as the number of
nodes increases, each node must be allocated an ever-larger workload, which means that the
mini-batch size must increase. Larger mini-batches, however, slow training convergence.
Thus, while larger clusters provide increased computation capacity, the training time is not
proportionally reduced [Goyal et al., 2017, Keskar et al., 2016]. In [Keskar et al., 2016]
the authors demonstrate that increasing batch size increases improper convergence to sharp
minimizers, which, in turn, gives rise to poor generalization and thus causes an increasing
gap between test and training accuracy. Table 7.1 shows the performance of small-batch
(SB) and large-batch (LB) variants of ADAM on six networks. Comparing LB and SB,
we observe that LB does not decrease the accuracy derived from the training set, but does
substantially reduce the testing accuracy.
Certain methods can somewhat reduce this loss of accuracy – e.g., using dynamic batch
sizes and fine-tuning the learning rate – but they do not solve the problem [Goyal et al.,
2017]. SB limits the parallelism that can be exploited by high-end computing clusters,
especially when data parallelism is used; SB is thus rarely used in large-scale training.
FPGA clusters are a competitive technology for CNN inference [Zhang et al., 2016a,
Sanaullah et al., 2018d, Lu et al., 2017, Chung et al., 2018, Fowers et al., 2018]. For CNN
training, however, their efficacy is still an open question; one that is addressed in this
work. Previous FPGA clusters for CNN training have generally worked in batch mode
(batch in the computational sense), which uses the distributed synchronous SGD algorithm
just described [Zhao et al., 2016, Guan et al., 2017, Cong et al., 2011, Hegde and Kapre,
2017, Moss et al., 2017, Lian, 2016]. In this approach, called Data Parallelism [Ben-Nun
and Hoefler, 2019], each FPGA executes all layers of the CNN. This is done in sequential
order, a layer at a time, with a new layer starting only after the previous layer has completed.
137
Table 7.1: Performance of small-batch (SB) and large-batch (LB). Note
that LB does not decrease training accuracy, but reduces the test accuracy
[Keskar et al., 2016]
Training Accuracy Test Accuracy
Network































































Data Parallelism has three significant disadvantages. First, optimal FPGA configurations
for different CNN layers vary greatly: either the FPGA is suboptimally configured, or the
FPGA needs to be reconfigured repeatedly at run-time. Second, the storage required for
weights and intermediate features is generally large enough that off-chip memory must be
used. And third, this entire approach suffers from the scalability problem of the distributed
synchronous SGD algorithm already described.
Another method, which we call Layer Parallelism, is to daisy-chain multiple FPGAs
and map the entire CNN onto a single pipeline. Zhang, et al. [Zhang et al., 2016a] used
Layer Parallelism to accelerate CNNs using FPGA clusters, but only for inference. Their
approach, however, still leaves two problems. First, the pipeline is not seamless; a particu-
lar layer might stall until the previous layer finishes. All features must, therefore, be cached
until the last feature of a layer is obtained. This requires large storage that necessitates the
use of off-chip memory. Second, the computational load varies greatly among layers. A
naive workload distribution can result in a large number of idle cycles due to inter-layer
dependencies. These two problems are present in both inference and training but have a
138
greater impact on the latter. In training, all features of the hidden layers must be cached
until their corresponding errors arrive through Back Propagation (BP), thus requiring much
larger memory. And due to BP, the number of operations per layer triples.
We propose FPDeep, a novel FPGA-cluster-based training framework for CNNs that
solves the problems just described. FPDeep does this by using a hybrid of layer and model
parallelism together with a number of new workload/weight balancing strategies. No re-
configuration is needed: each device computes only certain layers or a part of a single
layer; each device is optimized independently with respect to its own computation. The
cluster is now a single fine-grained pipeline so the batch size can be arbitrarily small. The
amount of data that must be saved is drastically reduced eliminating most off-chip memory
accesses. Internode communication is simple and pipeline utilization very high. To the best
of our knowledge, our work is the first on CNN training with FPGA-based clusters using
this method of parallelism and also the first with fine-grained workload/weight balancing.
The underlying theme of this work is to convert batch parallelism to pipeline paral-
lelism, which has obvious benefits. Parallelism is equal to the depth of the pipeline, in this
case, many thousands of stages across the cluster. Communication paths can be short so
cycle times are as small as the designer can make them. Communication among devices
is direct and contention-free with any latency having no effect on throughput. There is
also the aforementioned benefit of having all of the latency reduction applied to individ-
ual problem instances and so obviating the algorithmic challenges that come with larger
batches.
We find this approach to be effective with performance similar to that of GPU clusters of
similar size and technology, but with far better power efficiency. The limiting factor is inter-
FPGA bandwidth. But, somewhat surprisingly, we find that a 1D topology suffices and
that, even using only six transceivers per FPGA (Stratix-V era) [Wang et al., 2018b, Xiang
et al., 2018, Xiong et al., 2019, Geng et al., 2018b], FPDeep achieves linear speed-up to
139
83 FPGAs. Overall, with 250 Gb/s bidirectional bandwidth per FPGA, easily supported by
current generation FPGAs, FPDeep’s performance shows linearity up to 100 FPGAs. The
main contributions are as follows:
1) The possibility of breaking down the well-known scalability wall of CNN training
and demonstrating FPGA clusters to be a competitive technology for CNN training;
2) A novel framework for mapping CNN training logic to distributed FPGA clusters
that achieves both high efficiency and scalability; that does not suffer from issues related to
mini-batch size; and that needs only a simple interconnection network as is available in any
multi-FPGA system with efficient inter-FPGA communication and reasonable bandwidth;
3) A fine-grained pipeline design that minimizes the time that features need to remain
available while waiting for back-propagation, thus reducing the storage demand to the point
where only on-chip memory is required for the convolution layers;
4) Fine-grained partitioning and mapping methodologies, which provide almost perfect
workload and weight balancing among FPGAs; this is done by increasing the flexibility
of workload and weight allocation, thus leading to improved utilization: multiple FPGAs
can cooperatively compute the same layer, while multiple layers can also be mapped to the
same device;
5) An RTL code generator that automatically creates RTL implementations based on
the mapping scheme generated by FPDeep.
The organization of this chapter is as follows. In Section 7.2, related work is discussed.
In Section 7.3, the methodology of FPDeep is presented and the workload/parameter parti-
tion methods are defined. In Section 7.4, the overall system architecture is given. In Section
7.5, the implementation of each FPGA node’s accelerator is introduced. In Section 7.6, the
experimental results are presented. Discussion and further work are in Section 7.7.
140
7.2 Related Work
In this chapter we use VGG-16, a widely used neural network in image classification, as an
example to demonstrate various FPDeep features.
Much work has addressed the mapping of inference/training of CNNs to clusters with
programmable accelerators, including [Blott et al., 2017,Venkataramani et al., 2017]. Also,
many frameworks and libraries have been deployed, e.g., MXNet [Chen et al., 2015],
Caffe [Jia et al., 2014], and Tensorflow [Abadi et al., 2016]. These systems hide the com-
plexity of workload decomposition and provide friendly programmer interfaces, including
Python, R, and Scala. In [Mirhoseini et al., 2017], Google proposed a method that uses
reinforcement learning to optimize device placement for the Tensorflow computational
graph. [Huang et al., 2019] introduced GPipe to solve the problem that the DNN model
capacity increases to the point that the model is too big to fit in the memory of a single ac-
celerator. GPipe uses a batch-splitting pipelining algorithm to map AmoebaNet onto eight
GPUs. Microsoft proposed PipeDream [Huang et al., 2019], which exploits intra-batch
parallelism to train CNNs with GPU clusters. PipeDream uses dynamic programming to
find the optimal workload partition. A detailed comparison is given in Sections 7.3.2 and
7.4.4.
For FPGA-based clouds, the prior work is more limited. Microsoft’s Catapult project
[Ovtcharov et al., 2015, Caulfield et al., 2016] includes a parameterized CNN accelerator
cluster that can deliver over 1 TFLOPs with very high energy efficiency. Zhang’s CDSC
FPGA-Enabled Cluster accelerates CNNs on top of Spark and Hadoop [Cong et al., 2011,
Zhang et al., 2016a]. In [Zhang et al., 2016a], researchers build a deeply pipelined FPGA
cluster with 6 Xilinx VC709 boards to accelerate CNNs. In [Zhao et al., 2016], an FPGA-
based framework of CNN training is proposed but focuses mainly on single-FPGA designs.
Most distributed CNN systems, including TensorFlow and CNTK, are based on the
distributed synchronous SGD algorithm (Centralized Parallel SGD algorithm - C-PSGD).
141
The Parameter Server Topology [Li et al., 2014] uses a central parameter node connected
with multiple worker nodes. There are multiple bottlenecks including communication load
on the central node [Lian et al., 2017] and idle time while waiting for straggling worker
nodes [Chen et al., 2016]. Also, for large-scale clusters, the growth in the SGD mini-batch
size limits scalability. Lian, et al. use a decentralized parallel SGD algorithm (D-PSGD) to
build a large-scale cluster [Lian et al., 2017]. Each node must maintain its own local copy
of the model and data duplication is inevitable.
Fig. 7·3 shows the design space for mapping CNNs onto distributed nodes. We use the
terminology introduced by [Ben-Nun and Hoefler, 2019]. Data parallelism (Fig.7·3(A))
is the most popular approach in CPU and GPU clouds [Chen et al., 2015, Abadi et al.,
2016]. It is also widely used in existing FPGA clouds, such as Catapult and CDSC [Cong
et al., 2011]. This method has drawbacks as mentioned in Section 7.1. In CNNs, the
configurations of each layer, such as kernel size, pooling size, and stride size, vary greatly,
requiring different hardware designs to obtain optimal performance. Thus, FPGAs need
to be reconfigured between layers, leading to significant overhead. Also, as each FPGA
executes all layers in sequential order, each layer starts only after the previous layer has
completed. Thus, for all intermediate features, weights need to be stored to, and loaded
from, the host upon completion of a layer, leading to heavy off-chip memory traffic.
Layer Parallelism (Fig. 7·3(B)) maps layers of the CNN onto individual nodes and
pipelines the computation. It has been employed by both GPU [Huo et al., 2018b,Huo et al.,
2018a,Wu et al., 2016,Jia et al., 2018] and FPGA frameworks [Zhang et al., 2016a]. In [Wu
et al., 2016], each LSTM layer is assigned to a different GPU. Since each layer is mapped
to a certain GPU, workloads are not balanced among devices. For multi-FPGA systems,
Zhang, et al. [Zhang et al., 2016a] only addresses inference; also, the parallelism is coarse-
grained, the workload is unbalanced, and there is heavy off-chip memory communication.
So while Layer Parallelism mitigates some of the problems with batch size and frequent
142
reconfiguration, it suffers from other problems: load balancing and stalls as some nodes
wait for others to finish.
In Model Parallelism (Fig. 7·3(C)), weights for each layer are distributed across nodes.
Therefore, all intermediate results from all devices must be added up and then broadcast
to every device leading to heavy communication. This method has been used for AlexNet
[Krizhevsky et al., 2012].
7.3 FPDeep Framework
7.3.1 Overview
An operator graph G (Fig. 7·1) is used to describe the operations in DNN training. Each
node oi ∈ G is an operator (e.g., matrix multiplication or active function), and each edge
(oi,o j) ∈ G is a tensor (an n-dimensional array) that is an output of oi and an input of o j.
Each node om has a weight Op(om). Hardware constraint parameters (Tab. 7.2) are used to
describe all available hardware devices.








Figure 7·1: An operator graph is used to represent DNN training. Nodes
are operators and edges are tensors.
FPDeep thus has two sets of input parameters, from the Operator Graph and the Hard-
ware Constraint Parameters. The whole framework contains two parts: mapping and imple-
143
Table 7.2: Hardware Constraint Parameters
Notation Description
Device Num Number of FPGA devices in the cluster
LUTmax Number of Look-up table per FPGA device
FFmax Number of Flip-flop per FPGA device
BRAMmax Number of Block-RAM per FPGA device
DSPmax Number of DSP-slice per FPGA device
Transmax Number of available transceiver per FPGA device
mentation (Fig.7·2(A)). The Mapping Framework partitions the operator graph into several
fine-grained segments and maps them onto FPGA clusters so that every FPGA gets a bal-
anced workload and parameters. In the Implementation Framework, the RTL generator
creates RTL implementations for each FPGA based on the parameterized mapping, and
a cycle-accurate simulator gives measures of throughput, bandwidth demand, and percent
idle stages.
7.3.2 Operator Graph Partitioning Methodology
As mentioned in Section 7.2, DNN training includes FP, EB, and PG phases. The inter-
phase data dependencies lead to a complex operator graph G. General graph partitioning
methods, such as Google’s Reinforcement Learning (RL) method [Mirhoseini et al., 2017],
are useful approaches in the partitioning tasks of DNN training, but not efficient enough.
Finding the global optimal solution of graph partitioning is NP-hard, making the time
to find the best partition comparable to the DNN training time [Narayanan et al., 2019].
FPDeep takes advantage of the fact that DNN training logic can be modeled as a com-
putational pipeline consisting of groups of consecutive layers; this significantly simplifies
the optimization algorithm and makes it possible to return the exact solution in polynomial
time.
As shown in Fig. 7·2(B), FPDeep graph partitioning works in two phases: 1) Coarse-











Operator Graph Hardware Constraint 
Parameter





































Coarse-grained Phase Fine-grained Phase
BP
















Figure 7·2: (A) Overview of the FPDeep Framework. The operator graph
and hardware constraints are input parameters. (B) FPDeep contains two
phases: mapping and implementation. (C) The proposed DNN operation
graph partition methodology with ResNets and Inception.
145
1. Coarse-grained phase: The whole graph G is abstracted and simplified as a one-way
graph G. Each node in G presents the workload of forward and backward propagation of a
certain layer. The coarse-grained graph G is partitioned into multiple (number of FPGAs)
sub-graphs with similar sizes of workloads. This simplifies the partitioning process of G,
but results in a coarse-grained partitioning solution with high variance in the workload size
distribution of the sub-graphs.
2. Fine-grained phase: Each sub-graph Gi of G is replaced with the details of for-
ward and backward propagation. As shown in Fig. 7·2, Gi is presented with a finer tiling
unit with nodes representing convolution operations at different channels. In this phase,
FPDeep performs reallocation of FPGA resources in a finer-grained manner to reduce the
variance of workload distribution in phase 1. In Section 7.6, we showcase the proposed
graph partitioning method with practical DNNs e.g. AlexNet, VGG-16/19.
The proposed CNN training graph partitioning method is useful not only for Feed-
Forward DNNs (FFDNNs), but also DNNs with more complex topologies, e.g. Residual
Neural Networks, Inception, as they can also be modeled as pipelined groups of consecu-
tive layers with some extra abstractions (Fig. 7·2(C)). For example, the parallel convolution
and pooling kernels in Inception can be treated as additional output channels of a CONV
layer in FFDNNs followed by distributed and pipelined concatenation kernels for data re-
duction. For Residual Neural Networks, a shortcut can be treated as extra channels of the
convolution kernels being bypassed by the shortcut. Support for DNNs with more general
topologies will be included in the next-generation FPDeep.
7.3.3 Design Choices in Operator Graph Partitioning
As shown in Fig. 7·3, we can use a cube to represent a node in the operator graph. For
each operator, there are three parallelizable dimensions: Sample, Model, and Layer. All
available partition choices are shown in Fig. 7·3 and Tab. 7.3. Four metrics are used to






























A. Data Parallelism (Sample Dimension)
B. Layer Parallelism (Layer Dimension)
D. Hybrid Parallelism (Model & Layer Dimension)



























Figure 7·3: Illustration of operator graph partition design choices: (A) Data
parallelism, (B) Layer parallelism, (C) Model parallelism, and (D) Hybrid
parallelism (Layer + Model)
1. FLOP Utilization. Maximum FLOPs can be achieved when every DSP
slice processes a valid operation every clock cycle. Real performance is less than
ideal because of workload imbalance or synchronization overhead. FLOP Utilization
(RealFlops/MaxFlops) captures this behavior.
2. Storage Requirement. During DNN training, model parameters and temporal ac-
tivations must be stored. The total Storage Requirement determines whether all necessary
data can be stored in on-chip memory.
147
Table 7.3: Operator Partition Choice. Different operator graph partition
design choices make possible different parallelizability methods.
Parallelizability Method
Sample Layer Model
Data Parallelism (DP) Y N N
Layer Parallelism (LP) N Y N
Model Parallelism (MP) N N Y
Hybrid Parallelism (HP) N Y Y
3. Communication Footprint. FPGAs need to synchronize data among co-workers.
The Communication Footprint specifies the entire communication data throughput of one
mini-batch SGD iteration.
4. Communication Bandwidth is the communication footprint divided by the time of
one mini-batch SGD iteration. This metric is used to characterize burstiness.
Fig. 7·4 shows results for these four metrics for different parallelization methods and
scales of FPGA clusters. For VGG-16, we set the batch size to 128 so that the DP method
works for clusters with fewer than 128 devices. There are 16 layers (13 convolution and 3
fully connection); thus the LP method only works for a cluster with less than 16 devices.
Also, the minimum channel count is 64 (Layer CONV-1,2) so the MP method works for
clusters with less than 64 devices.
A. Analysis of Flop Utilization. LP is the best choice because all devices work in a
pipeline manner. However, because of variations in the DNN layer’s operation count, LP
still suffers from workload imbalance. DP is the second choice. In each mini-batch SGD
iteration, DP must synchronize the DNN model globally, which causes serious communi-
cation overhead for large scale FPGA clusters. MP is the worst choice since it needs an
additional layer for synchronization among different channels.
B. Analysis of Storage Requirement. LP is the best choice because of pipelining,
which means the cluster does not need to store temporal activations off chip and the DNN
model’s parameters are distributed. Clearly, when the cluster is small, storing all neces-
148
sary data in the FPGAs’ on-chip memory is a challenge. But when the cluster is large
enough, the size of on-chip memory is not a bottleneck. For DP and MP, each FPGA must
keep its own copy of the DNN model’s parameters. Also, all temporal activations must be
maintained in local memory.
C. Analysis for Communication Footprint. DP is the best choice because all tempo-
ral activations are stored locally and only DNN model parameters need device-to-device
communication. LP is the second choice because devices work in a pipeline manner and
each device needs to synchronize activations with adjacent devices. MP is worse because
it needs to both synchronize parameters globally and synchronize activations among chan-
nels.
D. Analysis for Communication Average Bandwidth. DP’s bandwidth is the low-
est due to the centralized burst communication pattern. It only synchronizes the model’s
parameters after all workers finished their jobs. When the workers are busy, the device-
to-device links are idle. In LP, due to all devices working in a pipeline manner, they need
to synchronize activations with the adjacent nodes. The communication is stable and the
bandwidth of LP is larger than for DP.
FPDeep Summary: Rather than DP, LP, or MP, FPDeep uses a hybrid parallel method
(Fig. 7·3). It works in a deeply pipelined manner with workload balanced among devices;
this improves FLOP Utilization. The balanced allocation policy also reduces the Stor-
age Requirement of the device-to-device activation buffer. As there is no “free lunch,” all
temporal activations must be transferred among devices. Thus the communication band-
width is the system’s bottleneck for large scale clusters. Tab. 7.4 compares the different
partition methods. Performance details are presented in Section 7.6.
7.3.4 Mathematical Model of FPDeep
As shown in Fig. 7·2, the mapping phase of FPDeep has two parts: operator graph parti-
tioning and FPGA resource allocation. We present a mathematical model for this process.
149
Figure 7·4: Comparison of different operator graph partition methods ac-
counting for four different metrics: FLOP utilization, storage requirement,
communication footprint, and average communication bandwidth
We assume N FPGAs and an operator graph G.
Operator graph partitioning
In this step, the operator graph G is partitioned into a set of sub-graphs G = {G1,G2 · · ·GN}.
Function Op returns the operation count of a sub-graph. For example, Op(Gi) is the op-
eration count of operator graph Gi and Opmin(G) is the minimum operation count of the
sub-graph set G . Because the FPGA cluster is pipelined, the variance of the sub-graph







Table 7.4: Qualitative comparison, from 1 (worst) to 4 (best), of different












Data Parallel 2 2 2 4 4
Layer Parallel 1 3 3 2 2
Model Parallel 3 1 1 1 1
FPDeep 4 4 4 3 3
FPGA resource allocation
In this step, the FPGAs’ hardware resources are allocated according to the sub-graph set
G . The resource allocation step is an optimization problem. The pipeline is constructed
from Convolution Engines (CEs), which are used to handle compute-intensive convolution
operations, and buffers (Buf), which are used to store CNN model parameters and temporal
activations. Convolution engines are composed of 2-D systolic arrays that consume input
features from shift-registers. Their design is similar to those in [Zhao et al., 2016,Wei et al.,
2017]. The FPGAs’ resources can be expressed as a tuple: (LUT,FF,BRAM,DSP).
As mentioned above, for large FPGA clusters, the goal is to maximize the cluster’s
throughput (T), while for small FPGA clusters, the goal is to minimize the storage require-
ment. In this context, size is relative and it depends on the ratio of the size of the neural
network to the FPGA resources. The constraints lie in the hardware resources at each
device and are denoted as (LUTmax,FFmax,BRAMmax,DSPmax).
For large clusters the number of CEs in device i is denoted as CE i. The theoretical
maximum performance of these CEs is Per f (CE i). These convolution engines need buffers
Bu f i, which is the function of CE i (Eq. 7.2). The overall throughput of the cluster is T and
depends on the node with the lowest performance (Eq. 7.3).
Bu f i = f1(CE i) (7.2)
151
In FPGAs these Convolution Engines or Buffers can be built with hard DSP-
slices/Block-RAMs or distributed Lookup-Tables/Flip-Flops. We build some CEs (αCE i)
with hard DSP slices and other CEs ((1−α)CE i) with LUTs/FFs. Similarly, some buffers
(βBu f i) are built with hard BRAMs and others ((1−β)Bu f i) with LUTs/FFs. Equations
7.4, 7.5, 7.6, and 7.7 define the hardware resource constraints. Functions f2, f3, f4, f5, f6, f7
return the consumption of the corresponding hardware resource. The target function for a
large cluster is the maximum throughput T :
T = min(
Op(Gi)
Per f (CE i)
) (7.3)
subject to:
LUT i = f2(αCE i,βBu f i,(1−α)CE i,(1−β)Bu f i)≤ LUTmax, (7.4)
FF i = f3(αCE i,βBu f i,(1−α)CE i,(1−β)Bu f i)≤ FFmax (7.5)
BRAMi = f4(βBu f i)≤ BRAMmax (7.6)
DSPi = f5(αCE i)≤ DSPmax (7.7)
For small clusters the target function minimizes the storage requirement S; the con-
straints are the same as the large cluster case. To fit all DNN training logic into a small
cluster, we propose a method called parameter balancing.
S = max( f6(βBu f i)+ f7(βBu f i,(1−β)Bu f i)) (7.8)
Fig. 7·5(A) shows the number of model parameters and activations in VGG-16. Ob-
152
serve that from the first to the last layer the number of activations is decreasing while the
number of parameters is increasing. The decrease in activations is because the dimensions
of the feature maps are reduced by the pooling layers. The increase in the parameters is
because the number of input and output channels increases in the later layers. In clusters
with small numbers of FPGAs (to accelerate VGG-16) the memory demand of parameters
for the later layers increases to the point where the on-chip memories in each FPGA are
not big enough to cache the allocated parameters.
Weight Size Weight Size
Weight
(B) Before Weight Balance  (C) After Weight Balance  
1 2 3 4 5 6 7 8 9L 987654321L
(D) Weight Allocation  
FPGA 1 FPGA 2 FPGA 3
Layer
(A)Million
Figure 7·5: Parameters and activations for VGG-16
To make enable the mapping of big networks to small clusters of FPGAs, parameter
balancing can be used. Figs. 7·5(B-D) show the method. Simply, parameters from the later
layers are stored in FPGAs where there is room, even if those FPGAs are some distance
away from where those parameters will eventually be used. For example, the parameters of
layer 8 are stored in FPGAs 3 and 1. During computation, the parameters stored in FPGA
153
1 are transferred to FPGA 3 through the communication network together with activations.
Note that the transport of parameters does not tighten the constraint on inter-FPGA com-
munication. This is because the smaller number of activations in the later layers cancels
out the added traffic for the parameters. Our experiments demonstrate the benefit of this
approach: only on-chip memory is needed for the CONV layers.
7.4 System Design
7.4.1 Input/output channel partition implementation
FPDeep uses hybrid parallelism to partition the operator graph. We begin by noting that
partitioning in the layer dimension is straightforward (Fig. 7·3(D)). The model dimension
is more involved and is done via input/output channel partitioning. As shown in Fig. 7·6,
each device executes the operations of a fraction of the input/output channels. Input feature
maps, along with model parameters, are partitioned in the ic/oc dimension and allocated
among FPGA devices. Each device generates the partial results and their sum is the final
output activation.
There are two ways to partition the graph, by input (ICP) or by output (OCP) channels.
These methods are shown in Figs. 7·6(A) and (B), respectively.
1) ICP: Layer 1 is partitioned and mapped to 4 devices. IC input activation channels
and corresponding weights are partitioned into 4 segments each containing IC/4 channels.
Each FPGA receives one of the 4 segments and calculates partial results of activations
for all output channels. Each complete output activation is calculated by summing up the
related partial results from the 4 FPGAs.
2) OCP: OC output activation channels are partitioned into 4 segments each containing
OC/4 channels. Each FPGA is responsible for calculating a certain segment of output acti-
vations. The 4 segments’ results are then gathered. In CNN training, all activations need to
remain available while waiting for back-propagation. Therefore, all IC input feature maps
154
are cached in every FPGA. This duplication leads to additional on-chip memory overhead.
This defect of OCP does not exist in ICP, so FPDeep prefers to use ICP: OCP is only used
when the number of the input channel is too small to provide sufficient parallelism. For
example, the first layer of AlexNet only has 3 channels of input features but 96 channels of
output features so OCP is used.
Broadcast









































(A) ICP case implementation
(B) OCP case implementation
Figure 7·6: Partitioning image input/output channels ic/oc
7.4.2 Dataflow Analysis and Interconnection Topology
Fig. 7·7(A) shows an N layer CNN mapped to an FPGA cluster with M devices. Each
CNN layer contains Oi operations (i ∈ [1,N]). The computation capacity of each device is
C operation per second. To balance compute workloads among devices, the workloads are
mapped to FPGAs in proportion to the device’s compute capacity. Each device needs to
155
Figure 7·7: Data flow analysis of CNN training. FPDeep pipelines the
reduction operations and maps them to multiple FPGAs.
execute W = ∑CiM operations.
Fig. 7·7(B) zooms in on the CNN training procedure that turns the dataflow into an
operator graph as the summation of all activation channels are cut into several pieces. Each
piece only needs to add the local convolution result Ri to the previous node’s intermediate
result Ri−1. The workloads, i.e., the arithmetic operations, of a whole network are parti-
tioned and mapped onto M FPGA nodes in proportion to their computation capacities. Fig.
7·7(C) shows the data streams in the FPGA cluster. Note that a 1-D interconnect topology
is sufficient.
Fig. 7·8 illustrates the topology design choice by mapping VGG-16’s CONV-3 -
CONV-5 layers onto a cluster with eight VC709 FPGA boards (see Section 7.6) according
to FPDeep’s operator graph partition method and FPGA resource allocation policy. The
red dotted box marks the communication bottleneck. Let us assume 10 board-to-board in-
156
Dev 0 Dev 1
Dev 2 Dev 3 Dev 4 Dev 5
Dev 6 Dev 7
64 x 112 x 112
64 x 112 x 112
128 x 112 x 112
128 x 112 x 112 128 x 112 x 112 128 x 112 x 112
32 x 56 x 56 32 x 56 x 56 32 x 56 x 56 32 x 56 x 56
128 x 56 x 56







Dev 0 Dev 1
64 x 112 x 112
Dev 2 Dev 3 Dev 4 Dev 5
128 x 112 x 112
64 x 112 x 112
128 x 112 x 112
32 x 56 x 56
128 x 112 x 112 128 x 112 x 112
64 x 56 x 56 96 x 56 x 56
Dev 6 Dev 7
96 x 56 x 56
128 x 56 x 56





 128x112x112 + 64x56x56
 128x112x112 + 64x56x56
channel_num x act_width x act_width
Input Activation
Partial Output Activation
(A) 2D topology implementation
(B) 1D topology implementation
Figure 7·8: 1D-2D topology design choice: while 2D seems the obvious
choice, clearly 1D has better performance
terconnections ports. First assume a 2D topology. We see that Dev-1 is the bottleneck:
because the degree of Dev-1 is 5, each communication link of Dev-1 only has two ports.
With a 1D topology, however, Dev-4 is the bottleneck. Some of the partial output activa-
tions need to be duplicated (dotted arrows in and out of Dev-4). But the degree of Dev-4 is
only 2 while each communication link has 5 interconnection ports, making the communi-
cation more efficient.
Fig. 7·9 shows quantitatively how the choice of topology affects performance. For
clusters larger than 5 nodes, the 1D topology is better. As the number of nodes increases,
157
this advantage becomes even more apparent. For clusters with 4 nodes, the 2D topology is
better because the degree of the bottleneck device is only 2.
A further advantage of 1D topology versus 2D is its simplicity. With only single links,
different dataflow types are multiplexed and easily scheduled. Also, for 2D the reduction
operation of each DNN layer is centralized, which incurs significant synchronization over-
head and requires more data movement.
Another consideration is that, practically, FPGA accelerator boards almost always have
less communication capability than the FPGAs themselves, both in BW and number of
ports. This makes the choice of 1D even more crucial. An interesting exception is for
accelerator boards with multiple tightly coupled FPGAs. For single boards with, say, four
FPGAs, we have already noted that 2D is preferred. For clusters with multiple multi-FPGA
boards, because internode connectivity is more limited than intranode, the preferred inter-
node topology is again 1D. Within the node, however, the additional in situ connections
remain useful leading to a hierarchical topology.
7.4.3 Deep Fine-Grained Pipeline
To illustrate data dependencies during training we use as an example of two CONV layers
with 3×3 kernel size. The operations of these two layers’ forward/backward propagation
are shown in Fig.7·10(A). In forward propagation, a 7×7 feature map is fed into Layer 1
and a 5×5 feature map is generated. At layer 2 the 5×5 feature map is convolved with the
parameters and inferred to a 3×3 feature map. In the backward propagation, the 3×3 error
map is padded to 7×7 before it is fed to Layer 2. Next, the error map and corresponding
parameters are convolved and another (5×5) error map is produced; this is used for Layer
2’s weight/bias gradient calculation. At Layer 1, the 5×5 error map is padded to 9×9 and
then convolved to 7×7.
The Fig.7·10(A) depicts the data dependency of forwarding and backward propagations
during CNN training. For the forward propagation phase, the image is inferred through
158

















(C) 1D topology implementation
Figure 7·9: 1D-2D topology performance comparison
all layers. To determine the data dependency, we start from the four activations at the
output feature map at the last layer, which are marked as black, red, blue, and yellow,
respectively, and trace backward to find the region of the input feature maps on which each
depends. For the backward propagation phase, errors calculated are propagated backward
through the network. To calculate gradients at a particular layer, errors which are backward
propagated from the next layer and activations of its feature maps are necessary. Hence,
the feature maps, which are generated in the forward propagation phase, need to remain
available awaiting backward propagation. As shown in Fig.7·10, activations, and errors
among CONV layers show only fine-grained dependencies. That is, to begin computing
the value of a pixel in a layer, only a fraction of the pixels from the previous layer are
needed. Therefore the computation of a layer can start much earlier before the previous
layer is completely done. This provides the opportunity to process all CONV layers in
parallel in a fine-grained pipeline.
The Fig.7·10(B) shows the traditional method of accelerating CNN training. First, the
N channels of the feature maps are fed into the convolution layer L1. Next, results from all
M channels begin to be processed while the convolution kernel slides across the N-channel
feature maps. Much storage capacity is needed to maintain all temporal feature maps.
Clearly, this method is not efficient. The fine-grained alternative is shown in Fig.7·10(C):
159
the calculation of an activation/error starts as soon as its dependent activations/errors are
propagated from the previous/next layer. The basic process unit of FPDeep is an activa-
tion/error of a feature/error map; this is in contrast to the traditional method’s basic unit of
the entire feature/error map. The result is both a large increase in parallelism through the
added pipeline stages and a reduction in storage so that only on-chip memory is needed.
7.4.4 Parameter Alignment
In contrast to traditional DP, where centralized gradient aggregation and weight update
need to be performed sequentially, FPDeep conducts all of these processes in parallel.
In order to achieve full hardware efficiency, we use a distributed and slightly-unaligned
weight update scheme: gradient calculation, aggregation, and weight update are always
performed locally. With respect to the parameter alignment, after the last training sample
in a certain mini-batch (round M) is forward propagated through the cluster, its backward
propagation follows immediately. At the same time, the forward propagation of training
samples in the next mini-batch (round M + 1) follows using the old parameters of round
M until the last training sample (from round M) is backward propagated. The deep fine-
grained pipeline used in FPDeep guarantees fast feature and error propagation and reduces
the time that old parameters need wait for the weight update, i.e. it eases the parameter
alignment issue. Based on our experiments, with the mini-batch size as 1K and a cluster
with 100 FPGAs, only the first 5 training samples of each epoch suffer from the resulting
slight non-alignment. Moreover, this slight parameter non-alignment does not affect the
convergence rate (as discussed in Section 7.6.4 and shown in Fig.7·15(B)(D)(F)).
Existing work considers parameter alignment and high throughput to require a trade-off.
Google’s GPipe [Huang et al., 2019] and Microsoft’s PipeDream [Narayanan et al., 2019]
use a similar pipeline scheme to build a distributed DNN training system but use different
alignment methods. GPipe divides the input mini-batch into several smaller micro-batches,
enabling different GPUs/TPUs to work on different micro-batches simultaneously. GPipe
160
needs to flush the pipeline and synchronize the gradients among all accelerators after the
computation of the whole mini-batch finishes. GPipe’s solution introduces many bubbles in
the pipeline. Moreover, GPipe focuses on fitting oversized DNN onto multiple accelerators,
not solving the large-batch training problem. To the best of our knowledge, the approach
used in GPipe makes the large-batch problem even more severe: more accelerators require
more micro-batches and, in order to fill up each device, the size of micro-batches must be
relatively large. As with GPipe, Microsofts’s PipeDream also uses coarse-grained workload
partitioning and pipelining. However, in contrast to GPipe, PipeDream suffers from the
parameter alignment issue. The authors propose a technique called weight stashing to save
multiple versions of the parameters and so align parameters on a slightly longer time scale.
The optimization target of this work is throughput, i.e. epoch/h. Note that in FPDeep,
higher throughput is equivalent to reduced training execution time as FPDeep, because of
the small mini-batch size, does not require more epochs to converge.
7.5 Hardware Accelerator Architecture
7.5.1 Overall Architecture
The overall architecture of the multi-FPGA accelerator and the detailed architecture of each
FPGA are shown in Fig. 7·11. FPDeep maps a 3-layer CNN to p FPGAs. All p FPGAs are
connected in a 1-D topology. In Fig. 7·11, Ft denotes the input features of layer t and Et
denotes the errors backward propagated from layer t. There are six key data-paths. Steps
1, 2 and 5 are for FP, while 3, 4 and 6 are for BP.
1. Output activations from layer (t−1) are allocated to FPGAs of layer t according to the
ICP results. Each FPGA caches a segment of the features allocated to it and propagates the
rest to the next node.
2. Using the segment of features cached at Datapath 1, each FPGA calculates partial results
























Layer 1 Layer 2
Layer 1 Layer 2
Layer 1 Layer 2
















PP P PP PP
PP P PP PP














PP P PP PP
PP P PP PP





















EP E E E E
EP E E E E
EP E E E E
EP E E E E




P P PP PPPP
P P PP PPPP
P P PP PPPP

















EP E E E E
EP E E E E
EP E E E E
EP E E E E




P P PP PPPP
P P PP PPPP
P P PP PPPP
















E E E E E E E
E E E E E E E
E E E E E E E
E E E E E E E
E E E E E E E
E E E E E E E
E E E E E E E
E E E E E E E
E E E E E E E
E E E E E E E
E E E E E E E
E E E E E E E
E E E E E E E
















EP E E E E
EP E E E E
EP E E E E
EP E E E E




P P PP PPPP
P P PP PPPP
P P PP PPPP





















E E E E E E E
E E E E E E E
E E E E E E E
E E E E E E E
E E E E E E E
E E E E E E E




















Figure 7·10: FPDeep’s fine-grained pipeline design showing (A) data
dependencies of CNN training; (B) traditional data parallelism’s coarse-
grained pipeline; (C) FPDeep’s fine-grained pipeline.
162
Figure 7·11: Overall architecture of FPDeep accelerator and block design
of each FPGA illustrating (A) the overall architecture of FPDeep; FPGAs
can work cooperatively on the same layer; also, multiple layers can be
mapped on the same FPGA; (B) architecture of FPGA m, which is allo-
cated to both layer1 and layer 2; (C) architecture of FPGA n+ 1, which is
fully allocated to layer 3.
to node m+ 1 through Datapath 2. After adding up partial features produced by nodes m
and m+1, the updated partial features are propagated to the next node.
3. In each cycle, errors from layer (t + 1) are back-propagated to the FPGAs of layer t
through Datapath 3.
4. Using errors from Datapath 3, each FPGA calculates the errors of the features allocated
to it at Datapath 1 and propagates them to the preceding node. Node m propagates the
errors calculated by itself first and then the errors transferred from node m+1.
5. Parameters are transferred from the node where they are cached for parameter load
balancing to the node where they are used to compute the output features.
6. The gradients of parameters are transferred from the node where they are produced to
163
the node where they are cached for parameter load balancing.
The proposed architecture is generally useful for SGD-based training of any feed-
forward CNNs and can be extended to support other CNNs with more complex topologies
such as ResNet and Inception. As described in Section 7.3.2, as long as a DNN can be de-
scribed as a one-way graph with nodes representing pipelined groups of consecutive layers,
FPDeep can efficiently partition and map its training logic to an FPGA cluster. New mod-
ules are needed as follows: aggregation to the filter concatenation in Inception; duplication
of FP, PG, and EB for the parallel CONV and Pooling kernels in Inception; and gather
and bypass for the various types of shortcuts of ResNet. Integrating these into FPDeep is
straightforward and will be part of the next-generation system.
7.5.2 Single-FPGA Architecture
As shown in Fig.7·11, each FPGA includes FP, PG, and EB modules, as well as a memory
subsystem to cache parameters, gradients, and activations. Each accelerator has 6 inter-
connection modules to communicate with its neighbors (this number is selected because
it is available on many boards used for FPGA clusters and is sufficient for good scaling).
An FPGA can be allocated to multiple layers. Implementations with FPGAs working for
single layer and for multiple layers are illustrated in Fig.7·11(B) and (C).
Interconnection
There are two pairs of interconnection modules. a) The upper pair, used by Forward datap-
aths 1, 2 and 5, 1) receives input features and partial features propagated by the preceding
node; 2) bypasses the input features which are not mapped to it to the succeeding node;
3) adds partial results produced by FP to the received partial features and propagates up-
dated partial features to the succeeding node; 4) forwards the parameters and gradients
from the node which caches them to the node which produces them. b) The bottom pair,
used by backward datapaths 3, 4 and 6, 1) receives errors from the next layer bypassed by
164
the succeeding node and passes them on to the preceding node; 2) receives errors of this
layer calculated by the the succeeding node; 3) after errors are calculated by the EB mod-
ule, propagates them to preceding node; 4) forwards the parameters and gradients from the
node which calculates them to the node which caches them.
Memory Subsystem
The memory subsystem includes BRAM-based modules and stores activations, parameters,
and gradients.
1. Activation RAM (Act-RAM) caches input activations mapped to the target FPGA until
they are consumed in back-propagation and provides input activations as operators to FP
and PG modules for output activation and parameter gradient calculation. After input acti-
vations are consumed in FP, they are kept in Act-RAM and wait to be reused during BP to
calculate parameter gradients. Act-RAM is implemented as a FIFO-based memory. In BP,
when errors are calculated and propagated backward from the adjacent FPGAs to a certain
device, the features stored earlier in Act-RAM are first consumed by the PG module for
parameter gradient calculations.
2. Local Parameter RAM (LPRAM) caches parameters used as operators to produce the
output activation at the local FPGA. For each FPGA, there are SF × K × K × OC param-
eters stored in LPRAM, where SF is the number of activations in the activation segment
allocated to an FPGA. To provide enough concurrency of parameter access, LPRAM is
designed as a SF × K × K-bank line buffer. Each bank caches OC parameters.
3. Balanced Parameter RAM (BPRAM) caches the parameters mapped to the local de-
vice for parameter load balancing. These parameters are used as operators in other FPGA
devices where on-chip memory is insufficient to cache all the required parameters. Both
LPRAM and BPRAM are updated by PG. Similar to LPRAM, BPRAM is implemented as
multi-bank line buffer. The number of banks and their depths are decided by the parameter
balancing scheme.
165
4. Local Gradient Buffer (LGB) caches the gradients of the parameters stored in LPRAM.
The gradients are cached and averaged at each iteration. At the point that a mini-batch size
number of input figures are completely trained, the averaged gradients are forwarded to
LPRAM and update the parameters stored in LPRAM.
5. Balanced Gradient Buffer (BGB) caches the gradients of the parameters stored in
BPRAM. These gradients are generated by and transferred from the device where the cor-
responding parameters are consumed.
7.5.3 Forward Propagation (FP)
The Line Buffer (LB) reads input features from the Act-RAM and feeds them to the Con-
volution Engines (CEs). The CEs perform convolutions with parameters from the LPRAM
and input activation from the LB. In the Special Function Unit (SFU), the output features
are activated, normalized, and sampled based on network specifications. Afterwards, fea-
tures are transferred to the Partial Activation Buffer (PAB) and added to the partial features
produced by and propagated by the preceding node. Finally, the updated partial activations
are propagated to the next device through the interconnection module.
We use ICP as an example to show how the FP module calculates partial activation re-
sults. Assuming a certain FPGA node has been allocated with S_IC channels, at each cycle,
K×K×S_IC activation are accessed from line buffers and broadcast to the CEs. Each CE
consists of S_IC convolution tiles. Each convolution tile has K×K multiply-accumulate
units and executes a K ×K convolution operation per cycle. With all convolution tiles
working on different input channels, at each cycle, each CE can finish calculating the par-
tial results of one output channel. In the FP module, there are multiple CEs calculating the
activations of different channels in parallel. The number of CEs, P, is determined by the
number of DSPs allocated to FP operations during the offline ICP mapping. When partial
activations are calculated, they are forwarded to partial activation buffers where they are
used to update the partial results produced from the previous nodes.
166
7.5.4 Error Back-Propagation (EB)
The EB module consumes errors from the next layer and produces errors for the target
layer.
It takes two steps to calculate the error of each input activation: (1) errors of all output
feature maps are convolved, respectively, with their parameter filters and (2) their convo-
lution outputs are summed. In FPDeep, for an FPGA allocated with S_IC input channels,
the EB module calculates S_IC errors in parallel. In EB, CEs are used to perform the con-
volutions of errors of output channels and their parameters. This is different from the FP
module where the number of convolution tiles in each CE is S_IC, rather, each EB mod-
ule has S_IC CEs. Each CE has P tiles and each tile can perform a K×K convolution
operation. The number of convolution tiles in each PE is pre-determined during ICP map-
ping. Taking P as the number of CEs, at each cycle, the errors from P from OC output
channels are broadcast to, and consumed by, S_IC CEs. The outputs are partial results of
errors at S_IC input channels. After OC/P cycles, all output channels are evaluated and the
complete results of errors are forwarded to the Error Buffer.
7.5.5 Parameter Gradient Calculation (PG)
PG consumes errors of the next layer propagated from the succeeding neighbor and cal-
culates gradients of the parameters. Errors of output activations are used as convolution
filters on the input activations which are cached in the Act-RAM during forward propaga-
tion. Gradients are cached in the Gradient Buffer and used to update parameters in LPRAM
when a mini batch of samples is trained.
In contrast to the convolutions at FP and EB where the filter size is normally smaller
than 7, the filter sizes in PG (R and C) can be in the hundreds. This requires expensive
convolution tiles – the resources can even exceed those of the FPGA. Even if this does not
happen, the PG may still occupy most of the computing resources and result in a serious
167
workload imbalance. In PG, K×K W ×W convolutions need to be performed. In FPDeep,
we cut K×K large convolutions into W ×W small convolutions (K×K). The overall oper-
ation count stays the same. But in this case, the PG module always fits the DSP resources
constraint and the CE array design of PG is similar to the ones in FP and EB.
7.6 Experiments and Evaluation
In this section, we describe experiments performed to evaluate the efficiency of FPDeep.
First, we evaluate the correctness and performance of the design on a small FPGA cluster
with eight Xilinx VC709 boards. Based on these results, we validate a cycle-accurate
software simulator. Because the small FPGA cluster is insufficient for complex neural
networks, we use the software simulator to evaluate the performance of FPDeep on large-
scale clusters.
7.6.1 Small Scale Cluster Experiments
The small scale cluster experiments use a cluster of eight Xilinx VC709 evaluation boards.
As shown in Fig. 7·12(A), each VC709 motherboard contains one XC7VX690T FPGA, an
FMC-HPC connector for the daughterboard extension, and an SMA FMC, which contains
32 SMA connectors. Some of the SMA connectors are used for forward propagation,
others for backward propagation. The eight FPGA boards are connected in a 1-D daisy
chain. This cluster is used to validate the parameterized hardware accelerators (Section
7.5), to perform the topology experiments (Section 7.4), and to validate the correctness of
the software simulator.
7.6.2 Large Scale Cluster Experiments
The FPDeep software simulator is based on MPI version OpenMPI 2.1.1 (Fig. 7·12); n
CPUs work in a pipelined manner. For each CPU, we have two MPI process groups,
one each for FP and BP. Additional MPI processes handle data exchange with adjacent
168










MPI Process for FP













SMA Connector for BP
SMA Connector for FP
Interconnection for FP
Interconnection for BP
VC709 0 VC709 1 VC709 6 VC709 7
(A) Small scale FPGA cluster
(B) MPI-based software simulator
Validate simulator
Figure 7·12: Hardware evaluation and MPI-based software simulator
CPUs, broadcast the previous CPU’s temporal activations, and reduce the current CPU’s
results. The simulator is parameterized to support various FPGA platforms. For example,
the MPI processes which handle communication have configurable parameters that enable
accurate simulation of the data exchange among FPGA boards with different interconnect
bandwidths and latency.
The software simulator is currently used to evaluate large-scale VGG-19, VGG-16,
and AlexNet training. For larger and more complex DNNs such as ResNet, GoogLeNet,
and Recurrent Neural Networks, we will use emerging large-scale FPGA clusters. For
example, the Open Cloud Testbed (associated with the Massachusetts Open Cloud), which
was kicked off at the end of 2019, will in the first stage be equipped with at least 64 Xilinx
Alveo 280 boards and be publicly available.
169
Figure 7·13: Experimental results and utilization report when mapping
AlexNet and VGG-16/19 to a cluster with 15 FPGAs
7.6.3 Utilization and Performance
FPGA Resource Utilization
For illustration, we map AlexNet and VGG-16/19 to a cluster with 15 FPGAs. Figs.
7·13(A-I) show resource utilization of each FPGA and resource allocations among the net-
work layers. As shown in the DSP utilization reports (Figs. 7·13(C)(F)(I)), the mapping
is well-balanced. The utilization of DSP slices is roughly 80% and the throughput of each
FPGA is around 1 TOPS. On-chip BRAM is only used in the FPGAs that work solely on
the CONV layers (FPGAs 1-14) and utilization of BRAMs is under 80%. The highest
bandwidth requirement among these 15 FPGAs for these three networks is 18.6 Gb/s.
For AlexNet (only 8 layers), the 15-node cluster does not require parameter balancing
to achieve the best performance so CLB and BRAM utilization have been left unbalanced.
For VGG-16/19, however, parameter balancing is required.
170
Table 7.5: Cluster-level experimental results. All CPU, GPU, and FPGA
implementations use single precision floating point. [Zhang et al., 2016a],
[LeaderGPU, 2018a], and [LeaderGPU, 2018b] do not give experiment re-
sults of training time per epoch
CPU GPU GPU FPGA FPDeep




















































0.39 4.22 7.87 6.86 6.55 8.28 37.09 37.88 38.13
Performance and Power Efficiency
Table 7.5 compares performance and power efficiency among the Titan X GPU [Zhang
et al., 2016a], the Tesla K80 GPU [LeaderGPU, 2018a], a previous FPGA implementation
[Zhang et al., 2016a], and this work.
[Zhang et al., 2016a] uses a workstation with an 8-core 3.8GHz AMD A10-5800K
CPU and an Nvidia Titan X GPU. We use a server with Nvidia Tesla K80 GPUs as the
golden model and baseline design. OpenBLAS and cuDNN libraries are used in software
implementations. [Zhang et al., 2016a]’s CPU & GPU and our GPU implementation are
all based on data parallelism, while [Zhang et al., 2016a]’s FPGA design is based on layer
parallelism; the latter results in inter-board workload imbalance. The power consumption
of all baseline and FPDeep systems are board-level and measured with a power meter.
FPDeep provides performance 5× higher than previous FPGA work and comparable to
the Titan X GPU. We evaluate energy efficiency with respect to GOPs/J. FPDeep provides
8.8× better energy efficiency than the Titan X and 5.6× better than the previous FPGA
work. Compared with the K80, FPDeep provides 5.7× better energy efficiency.
171
Load Balance and Optimization
Alexnet, VGG-16, and VGG-19 are mapped onto clusters of sizes 5 to 85 with the cycle-
accurate simulator. To demonstrate workload balance among FPGAs in different sized
clusters, we present the proportions of idle stages. Figs. 7·14(B)(D)(F) show that this is
always under 5%. When the number of FPGAs is more than 30, this number is stable
with fluctuation between 0.5% to 1%. Generally, as the number of FPGAs increases, the
proportion of idle stages decreases. The reason is that during ICP and OFP, the number of
DSPs allocated to each layer is rounded to a multiple of K×K. With more FPGAs and
DSP resources, this effect is reduced.









































(B) Percentage of idle stages for AlexNet

















(C) Roofline Model for VGG-16





















(D) Percentage of idle stages for VGG-16















(E) Roofline Model for VGG-19




















(F) Percentage of idle stages for VGG-19
Figure 7·14: Roofline models, percent idle stages, and epochs per hour of
AlexNet, VGGNet-16, and VGGNet-19
172



















(A) Epochs/h for AlexNet














(B) Convergence rate for AlexNet





(C) Epochs/h for VGG-16













(D) Convergence rate for VGG-16
Centralized Training
FPDeep Training








1.75 (E) Epochs/h for VGG-19






































Figure 7·15: FPDeep’s performance scalability and convergence rate.
Computation and communication are critical constraints in system throughput. The
roofline plots of AlexNet, VGG-16, and VGG-19 are shown in Figs. 7·14(A)(C)(E). Note
that the throughput has linear scaling up to the constraint imposed by inter-FPGA com-
munication. For example, with 150 Gbps as the inter-board communication constraint,
FPDeep shows linearity up to 83, 56, and 70 FPGAs for Alexnet, VGG-16 and VGG-19,
respectively. As each transceiver (of that generation) can reach a maximum rate of 28
Gb/s, using 6 transceivers per FPGA achieves this number [Geng et al., 2019a,Sheng et al.,
2018b, Sheng et al., 2018a, George et al., 2016, Yang et al., 2019a, Yang et al., 2019b].
Since high-end FPGAs frequently have more than 50 transceivers, scaling to much
larger clusters is possible. The reason that bandwidth required by VGG-16 is larger than
173
VGG-19 is straightforward: VGG-19 has more layers and thus more workload. During
partitioning, with the same overall hardware resources, each layer of VGG-19 is allocated
fewer resources. Thus, fewer batch features in each layer can be computed and transferred
in parallel, which results in a smaller bandwidth requirement.
7.6.4 DNN Model Convergence
Figs. 7·15(A)(C)(E) show the number of epochs that can be trained per hour. FPDeep
provides a linear speedup of training per epoch. As hybrid model/layer parallelism does
not constrain the choice of mini-batch size, the optimal learning rate and mini-batch size
can always be applied in SGD, leading to the minimum number of epochs needed for
training of a given accuracy. Hence, the linear speedup of training per epoch results in a
linear speedup of CNN training.
Figs. 7·15(B)(D)(F) show the convergence rates of FPDeep and the traditional cen-
tralized DP training using the same small mini-batch size in SGD. The results show
that FPDeep has similar convergence rates compared with the traditional centralized DP
method, demonstrating that the slightly-unaligned weight update of FPDeep does not in-
troduce additional training epochs. For the centralized case, we use a Sugon W740-G20
GPU server, which contains two Tesla K80 GPUs. The experiment is based on the Darknet
framework; the CUDA library is cuDNN 5.0. For the FPDeep case, we use a Sugon CX50-
G20 CPU cluster with an Intel Xeon E5-2680 v3 CPU. The FPDeep software simulator is
compiled with gcc 7.1 and OpenMPI 2.1.1. The training dataset is CIFAR-10.
7.7 Discussion and Future Work
We propose a framework, FPDeep, which handles the model-level parallelism efficiently in
DNN training and maps training logic of DNNs to multi-FPGA clusters with high efficiency
and also automatically generates RTL implementations for target networks and clusters.
174
With FPDeep, clusters of FPGAs work in a deeply-pipelined manner using a 1-D topol-
ogy; this enables the accelerators to map directly onto existing platforms, including Cata-
pult, Catapult2, and almost any tightly-coupled FPGA cloud or cluster. FPDeep uses two
mechanisms to facilitate high-performance and energy-efficiency: 1) various fine-grained
partition and mapping strategies to balance workloads among FPGAs and 2) training of
CNNs is executed in a fine-grained inter- and intra-layer pipelined manner, which reduces
the time that features need for backward propagation and leads to a reduction in storage de-
mand to the point where only on-chip memory is required for CONV layers. Experiments
show that FPDeep has good scalability to a large number of FPGAs. The bottleneck is
inter-FPGA communication bandwidth. However, we find that with 250 Gb/s bidirectional
bandwidth per FPGA, which is easily supported by current generation FPGAs, FPDeep’s
performance shows linearity up to 100 FPGAs. For example, using Alexnet and the VG-
GNets as benchmarks, with 6 transceivers per FPGA (e.g., using a 2014-era Altera Stratix-
V), FPDeep shows linearity up to 83 FPGAs. We evaluate energy efficiency with respect
to GOPs/J and find that FPDeep provides 5.7x to 8.8x higher energy efficiency than GPU
servers. Furthermore, results also show that the system utilization of FPDeep is always
over 95% for FPGA clusters with 1 - 100 nodes. This demonstrates that FPDeep handles
model irregularity efficiently in DNN training. Although FPDeep is designed for large-
scale training, it is also generally useful for cluster-level inference.
We briefly discuss future work. One area is supporting more complex NN models.
Here, two additions are needed. First, while the current graph partitioning method supports
ResNet and Inception (as described in Section 7.3.2), RNN and other new models require
more complex graph structures. Second, support needs to be added for additional mod-
ules as described in Section 7.5.1. A second area is investigating benefits of hierarchical
communication networks as arise when the nodes are multi-FPGA boards. Finally, another
interesting question is use of off-chip memory. Currently, we only use off-chip memory
175
when we are processing the fully connected layer. In the case of small clusters and large
networks, where off-chip memory would be an option, we instead use the weight balancing
scheme described in Section 7.3.4. In the future, as HBM becomes widespread, for very
large networks it could make sense to use off-chip memory as an intermediate buffer to
store activations and parameters.
176
Chapter 8
CSB-RNN: A Faster-than-Realtime RNN
Acceleration Framework
In this chapter, we introduce CSB-RNN (Compressed Structured Block-based RNN), a
RNN acceleration framework. CSB-RNN is based on software-hardware co-design. On the
software side, RNN models are trained to follow some relatively regular data patterns. The
regularization process significantly lowers the hardware design complexity and only incurs
negligible accuracy loss. On the hardware side, a programmable architecture is proposed
to efficiently conduct the inference of the regularized RNN models. Different from Chapter
3-7 which aim at accelerating irregular models with hardware-only solution, this chapter
demonstrates that model regularization is also a valuable approach to NN acceleration in
some specific scenarios and applications. This chapter is based on the work presented in the
34th International Conference on Supercomputing (ICS) ©2020 ACM [Shi et al., 2020]1.
8.1 Introduction
RNNs have been widely adopted in the real-world applications for its high-accuracy on
temporal sequence analysis. However, the increasingly large model size and tremendous
computational workload of the RNN hampers its deployment on embedded (edge) devices,
which strictly require realtime processing with limited hardware resources. To address this
issue, structured weight pruning techniques [Mao et al., 2017, Narang et al., 2017, Wen
et al., 2018] have been proposed, which shrink the model size, reduce storage demand, and
1Runbin Shi and Tong Geng equally contribute in this work.
177
provide higher potential hardware performance by structurally eliminating the close-to-zero
weights and the corresponding arithmetic operations in inference. To keep the accuracy
loss acceptable, the attainable pruning rates delivered in the existing structured pruning
schemes are far lower than the redundant operation rates (i.e. the ideal pruning rates), so
that a large number of unnecessary operations are still performed during inference leading
to performance overheads.
This work addresses the abovementioned problem. We first propose a novel fine-
grained structured pruning technique (CSB pruning) that provides a comparable compres-
sion rate and test accuracy as the ideal ones while keeping the model hardware friendly.
During the training phase, each weight matrices are divided into fine-grained blocks, and
a structured pruning is conducted on every block independently. The pruned blocks are
encoded in a novel compressed structured block (CSB) sparse format for inference accel-
eration, which significantly reduces the weight storage demand while retaining the fine-
grained content in the original model.
To realize a realtime inference with parallel hardware, there are still multiple chal-
lenges to design an architecture that can exploit the benefits of CSB pruning in a seamless
manner. In particular, the parallel architecture should handle massive fine-grained blocks
with imbalanced workloads (sparsity) but maintain a high hardware efficiency (utilization).
Meanwhile, the architecture should be programmable for various RNN cell types (e.g.,
LSTM [Hochreiter and Schmidhuber, 1997], GRU [Cho et al., 2014]). To address these
issues, we propose an architecture-compilation co-design to realize the best flexibility and
acceleration performance. A programmable RNN dataflow architecture is designed that
supports existing RNN cell types. In particular, the CSB-Engine in our architecture is de-
signed with a novel workload sharing technique. With the one-shot compilation, the work-
load is automatically balanced among processing elements (PEs) in CSB-Engine, which
improves the hardware efficiency to a near theoretical value.
178
The major contributions are summarized as follows:
• We present CSB-RNN, an optimized full-stack RNN acceleration framework, which
facilitates running various types of RNNs with faster-than-realtime latency. CSB-
RNN includes three innovations: (1) an adaptive and fine-grained structured com-
pression technique, CSB pruning; (2) a programmable RNN dataflow architecture
equipped with CSB-Engine; (3) a compiler design with optimizations to achieve al-
most perfect workload balance.
• The proposed CSB pruning technique provides ultra-high (3.5×-25×) pruning rates
without any loss on accuracy. Furthermore, CSB pruning does not incur high-degree
computational irregularities, making highly efficient hardware acceleration possible.
• An architecture-compilation co-design is proposed to sufficiently exploit the benefits
of CSB pruning and provide close-to-theoretical peak performance with automatic
workload balancing.
• With experiments on 10 RNN models from various application domains, CSB prun-
ing demonstrates 3.5×-25× lossless pruning rate, which is 1.6× to 3.9× over ex-
isting designs. With the proposed architecture-compilation co-design applied, the
CSB-RNN delivers faster-than-realtime inference with the latency of 0.79µs-6.58µs
in an FPGA implementation. The proposed framework contributes to 1.12×-12.6×
lower latency (with even fewer computation resources) and 3.53×-58.9× improve-
ment on power-efficiency over the state-of-the-art.
8.2 Background
8.2.1 Temporal Sequence Processing with RNN
The recurrent neural networks (RNNs) deliver high accuracy in the temporal sequence pro-
cessing. A typical schematic of RNN computation is depicted in Fig. 8·1. Successive
179
frames (e.g., word, phoneme) from the temporal sequence (e.g., sentence, voice) are em-
bedded as input neuron-vectors (xt), and then sent to RNN cells for inference computation.
t represents the time point. The output neuron-vector (ht) contains the inference results
















e.g., Words, Audio, Video
Output Frames: 
e.g., Translation, PredictionContext Link
OUTPUT  INPUT
…
Cell Type: GRU Cell Type: LSTM
ft =  (Wfxxt + Wfhht 1 + bf )
it =  (Wixxt + Wihht 1 + bi)
ct = ft   ct 1 + it   ✓(Wcxxt + Wchht 1 + bc)
ot =  (Woxxt + Wohht 1 + bo)
ht = ot   ✓(ct)
<latexit sha1_base64="YiI9yYHWTD9/3V3X+LI8I289NVU=">AAADbHicdVJda9swFFXsfXTZR9NuDxtlIBZWGsqC3RW2l0HZXvbYwdIU4mBkRY5FZctI16NB+GU/cW/7CXvZb5j8sa5OsgtCl3OvzjlX3CgXXIPn/ew57p279+7vPOg/fPT4ye5gb/9Cy0JRNqFSSHUZEc0Ez9gEOAh2mStG0kiwaXT1qapPvzGlucy+wipn85QsMx5zSsBC4V7vO67CxGUI+PADDjRfpuQoSAkkUWymZWji69JcV+XjDpqUJrE3vPFLfGyiMoxHOAj6DR3/Px3fSsc36PgtOtrSNTYDuZBQYW13I1ejASQMunr0Rg934I4grhTp6EZQNoJb7MvtdHLDv/zHlvy1L9d81oONwsHQG3t14M3Eb5MhauM8HPwIFpIWKcuACqL1zPdymBuigFPByn5QaJYTekWWbGbTjKRMz029LCV+bZEFjqWyJwNco7dfGJJqvUoj21nNp9drFbitNisgfj83PMsLYBlthOJCYJC42jy84IpRECubEKq49YppQhShYPezbz/BXx95M7k4GftvxydfTodnH9vv2EEH6BU6Qj56h87QZ3SOJoj2fjm7znPnhfPbfeYeuC+bVqfXvnmKOuEe/gFNTBUF</latexit>
zt =  (Wizxt + Whzht 1 + bz)
rt =  (Wirxt + Whrht 1 + br)
eht = ✓(Wigxt + Whg(rt   ht 1))






Figure 8·1: Computation flow of RNN inference. Note that there are mul-
tiple RNN cell types. The main workload is matrix-vector multiplication
(MVM).
Multiple RNN cell types exist that are composed of different computational dataflow
but almost the same arithmetic primitives. Fig. 8·1 lists the arithmetic of two widely-used
RNN cells, GRU [Cho et al., 2014] and LSTM [Hochreiter and Schmidhuber, 1997]. The
significant workload is matrix-vector multiplication (MVM) between the weight matrices
and input/hidden neurons; And the rest workload is element-wise operations, including
Sigmoid(σ)/Tanh(θ) activation function, element-wise multiplication () and addition. In
particular, the RNN cell computation at time t invokes the intermediate vector ct−1 and
output vector ht−1 from the previous timestamp. The data dependency results in a context




    (non-structured)
(b) Row/Column-wise 
Pruning (structured)
✔ High pruning rate 
✘  Hardware unfriendly
✘  Low pruning rate 
✔  Hardware friendly






Figure 8·2: CSB pruning takes advantage of both non-structured (random)
pruning (a) and coarse-grained structured (row/column) pruning (b).
8.2.2 RNN Weight Pruning Techniques
Non-structured Pruning v.s. Structured Pruning
The pruning technique has been proposed for deep learning models to reduce redundant
(close-to-zero) weights and thus the computation workload. The early non-structured prun-
ing [Han et al., 2015] achieves a high pruning rate; however, the random sparse model
(Fig. 8·2 (a)) brings a high degree of irregularity to the inference computation, which is
unfriendly to either the modern parallel device or the hardware architecture design. Some
existing works [Han et al., 2017, Cao et al., 2019] address this issue by pruning model
with region-balanced sparsity (between non-structured and structured sparsity), which re-
duced the attainable pruning rate. As Fig. 8·2 (b), the structured pruning schemes [Wen
et al., 2016, Gao et al., 2018a] were proposed for hardware friendly purpose that the entire
row/column is removed as a whole in pruning. Although the pruned model maintains the
regularity and can even be compacted to a dense matrix, the pruning rate with this scheme
is relatively low due to the coarse pruning granularity. With the advantages of both the non-
structured and coarse-grained structured pruning methods, the CSB pruning in this work is
a fine-grained structured method that not only achieves a high pruning rate but also makes
the hardware acceleration possible.
181
Model Training with ADMM-based Pruning Technique
The training process is to find a proper set of weight values that reach the minimal classifi-





s.t. Wi ∈ Si, i = 1, ...,N
(8.1)
where function f represents inference loss on the given dataset, Si is the feasible set
of Wi, which is subject to the user constraints. In the regular RNN training, Si is R (i.e.,
no constraint), and thus the optimal weights (Wi) and bias (bi) for each layer can be ob-
tained by classical stochastic gradient descent (SGD) method [Bottou, 2010]. However,
once the weight pruning is conducted along with the training process, the constraint of
weight-sparsity represented by Si becomes combinatorial and no longer convex, which
prevents the Eqn. 8.1 from being solved by classical SGD. The advanced Alternating Di-
rection Method of Multipliers (ADMM) method [Boyd et al., 2011] is leveraged in our
CSB pruning scheme. The ADMM separates the weight pruning (during training) problem









s.t. Wi = Zi, i = 1, ...,N
(8.2)
where Zi is an auxiliary variable for subproblem decomposition, and the indicator func-






0 if Wi ∈ Si,
+∞ otherwise.
(8.3)
Then the Eqn. 8.2 can be decomposed to two subproblems listed in Eqn. 8.4 and


















||Wt+1i −Zi +Uti||2F (8.5)
where t denotes the iteration index in the ADMM process, Ui is the dual variable that is




i−Zti. Following the ADMM process, the
two subproblems are iteratively solved till convergence. The first subproblem (Eqn. 8.4) is







where proj is the Euclidean projection onto constraint set Si, which guarantees the
weight matrices exhibit the specific sparse pattern defined in the constraint Si for each
layer. In this work, we propose a new type of structured sparse matrix with the novel CSB
sparse format, which is the target pattern (Si) in our RNN weight pruning method. The






































(b) Sparsify the Block in Row/Column-wise
(c) CSB Format
(a) Weight Matrix Partition
Val={1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9,…}
RowIdx
ColIdx
= { 2, 4, 1, 2, 4,…}
= { 1, 4, 0, 2, 4,…}
n = { 2, 3, 3, 2, 2, 2 }  -- for 6 blocks










Figure 8·3: A novel structured sparse matrix (CSB) with its dedicated stor-
age format, which benefits both the pruning flexibility and hardware paral-
lelism.
8.3 CSB Pruning Technique
8.3.1 A Novel Structured Sparse Weight Format
Definition of CSB
We propose the compressed structured block (CSB), a novel structured sparse matrix for
model pruning that benefits both the pruning flexibility and the hardware parallelism in
inference acceleration. Fig. 8·3 illustrates the CSB matrix and the dedicated storage for-
mat. As Fig. 8·3(a), we consider the CSB-structured matrix (with a size of W ×H) to be
composed of multiple blocks with the size M×N. Each block is sparsified in the row/col-
umn-wise, as Fig. 8·3(b), in which the certain rows/columns are set to zero as a whole. By
doing so, the non-zero elements are located at the cross-points of the un-sparsified rows/-
184
columns only. A significant benefit of this structured sparsity is the non-zero elements in
each block compose a dense kernel matrix that provides a higher potential for parallel hard-
ware acceleration than the random sparsity. Corresponding to this particular sparsity, a new
sparse matrix format is developed for efficient storage and computation. As Fig. 8·3(c), the
CSB-format contains five arrays in three groups, (i) array n{} and m{} are the row and
column counts of the kernel matrix in each block; (ii) array RowIdx{} and ColIdx{} store
the index of un-sparsified (non-zero) rows and columns, respectively; Note that, the index
count for each block equals to the corresponding value in n{} or m{}; (iii) the non-zero val-
ues in successive blocks (row-major order) are concatenated and stored continuously in the
array Val{}. Because the inference computation accesses the sparse blocks in sequential,
the offset for arbitrary access is omitted in the CSB-format.
Advantages and Challenges of Pruning with CSB
We adopt the CSB structured sparsity in pruning the RNN models, which integrates two-
fold advantages of both the non-structured pruning and coarse-grained structured pruning
in Fig. 8·2. On one hand, CSB provides adequate pruning flexibility, because each block is
pruned independently, and the pruning rate varies among blocks that helps to preserve the
weights with important information. Physically, each element in the weight matrix repre-
sents the synapses (connection) between input neurons (matrix column) and output neurons
(matrix row). The pruning process is zeroing the synapses between two neurons without a
strong connection. The CSB pruning automatically groups the strongly-connected neurons
into blocks with high density while leaving the weakly-connected ones in the low-density
blocks. Further, the pruning granularity is adjustable via changing the block size; Such
that different weight matrices in RNN model can be pruned with various granularities. The
above flexibilities enable a high pruning rate while maintaining the model accuracy. On the
other hand, the un-pruned weight values in each block compose a dense kernel matrix that
makes the inference computation friendly to parallel hardware. Nevertheless, the blocks
185
may have different-sized kernel matrices that result in a workload imbalance issue while
mapping computation of blocks to parallel hardware. This paper carefully addresses this
issue with an architecture-compilation co-design in Section 8.4 and Section 8.5.
8.3.2 CSB Pruning Flow with ADMM
With the ADMM-based pruning technique in Section 8.2.2, the weight matrices can be
pruned to an arbitrary sparse pattern by defining the constraint S and applying the pattern-
specific projection in Eqn. 8.6. To obtain the RNN model with CSB pattern, we develop
the CSB pruning algorithm following the ADMM principle. Further, the maximum prun-
ing rate under lossless constraint is automatically achieved via the progressive pruning. The
entire CSB pruning flow is presented in Algorithm 1 with carefully specified annotations.
Initially, the baseline model (with dense weight matrix W) is obtained via classical SGD
training and input to the flow. Note that the bias vector (b) is omitted as the CSB pruning
flow does not touch it. The lossless accuracy (accu) is given as the constraint of the progres-
sive pruning. Two input parameters, initial pruning rate (initPR) and initial step of pruning
rate reduction (initPRStep) are set for tuning the pruning rate in the progressive flow. We
use the progressive increase manner in approaching the maximum value of lossless pruning
rate. Therefore, we set initPR to a small value (e.g., 4×) as the starting point, which surely
meets the lossless constraint. The variables PruneRate and StepPruneRate are initial-
ized to initPR and initPRStep, respectively, at the beginning. In each progressive iteration,
the flow performs re-training and pruning on the model with multiple epochs (e.g., 100
in Algorithm 1) to obtain the CSB-formatted weight matrix (Z) with the ADMM-pruning
fashion. In each epoch, two subproblems are alternatively solved following the principle
of the ADMM-pruning technique in Section 8.2.2. The function SGDTrain updates the
weights with classical SGD (1st subproblem, Eqn. 8.4), and the subsequent process prunes
the weight matrix and projects it to CSB-constrained set (2nd subproblem, Eqn. 8.5). The
process in Algorithm 1 details the projection corresponding to the general representation in
186
Algorithm 1: Auto Lossless CSB Pruning with ADMM
input : unpruned RNN model W; lossless accuracy accu,
block size in CSB M×N; weight matrix size W ×H
initial pruning rate initPR
initial step of pruning rate initPRStep
output: maximally compressed model with CSB pruning Z
// Initialization.
U = 0; Z = W; W∗ = W; Flag =False
PruneRate =initPR; StepPruneRate =initPRStep
// Progressive iteration.
repeat
foreach t ∈ [0,100) // Re-train and Pruning Epoch.
do
// Solve Eqn. 8.4 in ADMM (1st subproblem)
W∗=SGDTrain(W∗,U,Z)
// Solve Eqn. 8.5 in ADMM (2nd subproblem)
// Project weight matrix to CSB pattern S
Zi, j=Partition(W∗+U), i ∈ [0, WM ), j ∈ [0, HN )
foreach j ∈ [0,H/N) do
Z:, j=RowPrune(Z:, j, 1−
√
1−PruneRate)




U = U+W∗−Z // Update U
// Set progressive pruning rate.








until StepPruneRate ≤ 14 initPRStep & Eval(Z)≥ accu;
Eqn. 8.6. First, the weight from SGDTrain is partitioned to multiple blocks Zi, j following
the CSB method in Section 8.3.1. Then the RowPrune process is applied to each block-
column independently. Specifically, for each block-column, the `2-norm (accumulate the
187
square of all elements) of each row (with the size of M) is obtained; Then, a row-wise
pruning is conducted referring to the `2-norm values. Subsequently, the ColumnPrune is
applied to each block-row with the same behavior to RowPrune. Note that the pruning rate
in both RowPrune and ColumnPrune is 1−
√
1−PruneRate, which results in the target
PruneRate after the combined processes. Once the CSB-formatted weight matrix Z is ob-
tained, it will be sent to SGDTrain of the next epoch, along with un-pruned weight matrix
W∗ and accumulated difference matrix U. With multiple epochs, weight Z will eventually
meet the CSB pattern constraints and achieve good accuracy.
After each progressive iteration, the CSB pruned model is evaluated (Eval(Z)) and
compared to the lossless accu. The PruneRate is increased by StepPruneRate in the next
iteration if the accu is achieved. Once Eval(Z) < accu, the model is over-pruned and the
optimum pruning rate is just between the PruneRate of the two neighboring iterations.
Therefore, we reduce StepPruneRate by half and reduce the PruneRate by this new step
to further approach the optimum point. The progressive CSB pruning flow terminates until
the pruning rate reaches a target precision. For instance, as the last line in Algorithm 1, the
flow terminates when the pruning rate precision (StepPruneRate) ≤ 14 initPRStep.
8.4 Unified Architecture for CSB-RNN
8.4.1 Overview of Acceleration Framework
An overview of the CSB-RNN acceleration framework is illustrated in Fig. 8·4. Although
the CSB pruning (STEP1) shrinks the model size and therefore reduces the computation
in inference, parallel hardware acceleration is still in demand to achieve realtime perfor-
mance. The challenges in accelerating CSB-RNN are two-fold. First, the architecture
should be adaptive to various RNN cell types, i.e., LSTM, GRU, etc. Second, the kernel
matrix in fine-grained blocks may not provide enough inner-block parallelism for large-



































✔ High PE Efficiency
✔ Super Real-time
Figure 8·4: Overview of CSB RNN acceleration framework, including (i)
CSB pruning algorithm, (ii) unified RNN dataflow architecture, (iii) work-
load compilation with CSB pruned model.
leveraged. However, the pruned blocks may have different sparsities, leading to the work-
load imbalance issue for inter-block parallelism, which usually causes a low utilization of
processing element (PE). To address these challenges, CSB-RNN proposes an architecture-
compilation co-design. In the architecture aspect (STEP2), we propose a unified RNN
dataflow architecture that is programmable for different RNN cell types (Section 8.4.2);
In particular, a novel CSB-Engine is designed with the support of workload sharing and
is equipped in CSB-RNN architecture to address the workload imbalance issue (Section
8.4.3). In the compilation aspect (STEP3), we define control instructions for the hardware
and propose the compilation algorithms to conduct the particular RNN type computation
and balanced workload scheduling (Section 8.5).
8.4.2 Programmable RNN Dataflow Architecture
To generalize the architecture for different RNN cell types, we investigated the existing
RNN cells and extracted the arithmetic primitives, which compose the RNN computation




















Figure 8·5: RNN dataflow architecture. Operation units serve the RNN
arithmetic primitives; The programmable datapaths construct the proper
dataflow for target RNN cell via instructions.
each operation unit serves the corresponding arithmetic primitive. In particular, the CSB-
Engine computes the main workload, MVM, with the weight matrices after CSB pruning
(CSB-MVM). The units ×, + are the element-wise multiplication and addition. δ and θ
operate the activation functions Sigmoid and Tanh, respectively. The datapaths (arrows on
Fig. 8·5) interconnect the operation units and on-chip buffers, which transmit the interme-
diate results and compose the dataflow graph for RNN cell computation. Importantly, RNN
dataflow architecture provides the programmable datapath (red arrows on Fig. 8·5). Thus,
the proper operation units can be interconnected by programming control instructions for a
particular RNN cell type.
8.4.3 CSB-Engine
The CSB pruning scheme greatly shrinks the weight matrix size and therefore reduces
the main workload in inference. Although the fine-grained structure of CSB contributes
to the regularity and makes efficient hardware acceleration possible, it is still challenging
to design a parallel architecture that can fully exploit the benefits of CSB pruning. The
challenges in an efficient CSB-Engine design are two-fold. First, both the inner-block and
190
inter-block parallelism should be fully exploited, as the regular inner-block computation
provides very limited concurrency with small block size. Second, the inter-block workload
imbalance issue exists due to the sparsity varies among blocks. The following subsections
address these two challenges.
Hierarchical Design for Inner- and Inter-Block Parallelism
As illustrated in Fig. 8·6, the CSB-Engine design is in a two-level hierarchy, processing
element (PE) level and PEGroup level. The hardware instances in each level are organized
in a 2D fashion that the architecture is composed of K×L PEGroups, and each PEGroup
contains P×Q PEs. The parallel PEs inside one PEGroup process inner-block multiplica-
tion concurrently, while the PEGroups computing different blocks in parallel (inter-block
parallelism).
Inside each PEGroup, because the size of CSB kernel matrix (m× n) might be larger
than that of PE array (P×Q), multi-pass processing is required to handle the entire block.
Thus, each PEGroup contains a NeuronAccumBuffer, which stores the partial results and
sums up with the accumulation of horizontal PEs in each pass. The input neurons required
by the current block are preloaded to the BlockNeuronBuffer and broadcasted to the PE
array. Each PE column shares the same input neuron as the unpruned weights are verti-
cally aligned in the structured block with CSB pruning. Importantly, the WeightBuffer
provides the CSB-formatted weight (Fig. 8·3), including the weight values (kernel matrix)
for PEs, column index for BlockNeuronBuffer to read the proper input neuron, row in-
dex for NeuronAccumBuffer to accumulate the multiplication-results to proper address in
NeuronAccumBuffer, and the kernel matrix size (m× n) for the PEGroup control logic
which conducts proper pass count in both axes.
In the higher-level of the design hierarchy, the PEGroups process blocks in the row-
major order. The PEGroups in one column concurrently compute the vertical blocks.




















































Figure 8·6: Two-level hierarchical organization of CSB-Engine for the
main workload (CSB-MVM) computation.
while multi-ports are provided on BlockNeuronBuffer for concurrent access. Similarly,
the blocks in horizontal axis are mapped to PEGroups in the same row, with multi-pass pro-
cessing. After the computation of each block-row, the results in NeuronAccumBuffers are
accumulated in horizontal and output to ReorderLogic to obtain the output neuron vector.
Inter-PEGroup Workload Sharing
Workload Imbalance Challenge: The blocks in CSB pruned model may have different-
sized kernel matrices, and the resultant inter-block workload imbalance brings challenges
to exploit the inter-block parallelism on hardware. As Fig. 8·7(b) demonstrates, with
the straightforward design, the workload imbalance issue results in low utilization of
PEGroups. The presented MVM workloads are allocated to 2× 2 PEGrounps that each
contains 4 PEs. During the execution, PEGroup1-3 enter the idle state before the PEGroup4

































Figure 8·7: Inter-block workload imbalance issue occurs when mapping the
CSB pruned matrix (a) to the vanilla (basic) CSB-Engine (b), which results
in a low hardware utilization. We propose the workload sharing technique
that significantly increases the utilization and reduces the time consumption,
as demonstrated in (c).
ware. In fact, the imbalanced sparsity naturally exists in the RNN models. However, ex-
isting works [Han et al., 2017, Cao et al., 2019] relieve the hardware pain by pruning the
model with a region-balanced sparsity compulsively. As a result, the neglect of natural
sparsity-imbalance significantly harms the pruning ratio and model accuracy. By contrast,
we handle this issue by improving the architecture with the workload sharing technique.
Inter-PEGroup Workload Sharing: The concept of workload sharing is illustrated in
Fig. 8·7(c). Each PEGroup processes not only the originally allocated block but also a
partition of block from the neighboring PEGroup, which is arranged with a heavier work-
load. In the hardware aspect, as Fig. 8·7(c), dedicated workload sharing paths (red arrows)
are set for the inter-PEGroup data transmission, and the interconnection adopts the torus
topology in both dimensions. With the hardware support of workload sharing, PEGroup4
migrates the extra workloads to PEGroup2 and PEGroup3; And PEGroup2 migrates the
Block2 workload partition to PEGroup1. That significantly balances the workload and im-
proves the utilization. Considerations in the workload sharing design are two-fold. (i)
The input neurons should be sharable between the PEGroups; (ii) The output neuron accu-
mulation should be performed inter-PEGroups. We discuss these issues and our strategies
within two cases, in which the workload is shared between neighboring PEGroups in hor-
193
izontal or in vertical, respectively. For the horizontal sharing case, an extra data port is
set on the BlockNeuronBuffer to solve the issue (i), which enables the PEGroup to ac-
cess input neurons from the neighboring PEGroup in horizontal. The issue (ii) is naturally
solved by the hierarchical CSB-Engine design, as the PEGroup can store the partial results
of the shared workload partition in its local NeuronAccumBuffer, which will be accumu-
lated in horizontal after processing the entire block-row. For the vertical sharing case,
the PEGroup-column shares the same BlockNeuronBuffer, thus the issue (i) is naturally
solved by hardware. About the issue (ii), the PEGroup should accumulate the vertically
shared workload to its original PEGroup, as the vertical PEGroups compute different block-
rows that cannot be accumulated in a mixed manner. However, concurrent accumulation to
one address in NeuronAccumBuffer leads to the read-after-write (RAW) data hazard. To
address this issue, an accumulation path is set between vertical PEGroups and connected to
the adder, which accepts parallel results from neighboring PEGroups, sums up and stores
to the NeuronAccumBuffer for one shot. With the hardware support on workload shar-
ing, we propose the compilation scheme in next section that schedules the partition and
sharing by analyzing the CSB pruned matrix and generates the instruction to control the
hardware-sharing behavior.
8.5 Compilation for CSB Pruned Model
The proposed RNN dataflow architecture is controlled by the pre-compiled instructions.
The instruction set includes the macro-instruction and micro-instruction, where the for-
mer one conducts the operation units (in Fig. 8·5) for the proper RNN dataflow (cell type);
and the later one instructs the CSB-Engine with inter-PEGroup workload sharing behav-
ior as described in Section 8.4.3. Correspondingly, the compilation is composed of two
phases, RNN dataflow compilation (Section 8.5.1) and workload sharing scheduling (Sec-
tion 8.5.2).
194
8.5.1 RNN Cell to Dataflow Architecture
Macro-Instruction Set
We define the macro-instruction set for our RNN dataflow architecture (Section 8.4.2). As
Fig. 8·8, the micro-instruction is composed of multiple sections, that each section provides
control signals for corresponding RNN primitive hardware. All sections are concatenated
to form a very long instruction word (VLIW) item. Note that each section contains Count
operand to indicate the size of workload for corresponding hardware primitive. Thus, one
VLIW instruction is regarded as accomplished until all hardware primitives finish the work-
load. The operands in each instruction section are classified into two types, the Count
type controls the hardware iteration count, and the other operands indicate the proper data
source or destination for each primitive. For the first type, the value of Count in element-
wise operation units (only CSB-Engine excluded) is measured by data element as these
units perform element-wise operation. Differently, the CountH/V in CSB-Engine section
represents the horizontal/vertical block iteration counts over the entire CSB-Engine in pro-
cessing the particular weight matrix. For the second operand type, Addr(Memory) and
Addr(Buffer) give the access address of external memory (normally DRAM) and built-in
buffers in the architecture, respectively. Importantly, the programmable datapaths in the
architecture (Fig. 8·5) are indexed, and the DataFlowIdx is set in the operand to indi-
cate the proper data source or destination for hardware primitive. With the above settings,
RNN models with various cell types can be translated to several VLIW instructions that are
repetitively executed during RNN inference.
Macro-Instruction Compilation
The objective of compilation is to minimize the VLIW instruction count that maximizes
the utilization of operation units. We invoke the list scheduling method [Lam, 1988] that



















Addr(BufferB) Addr(BufferBias) Count -
- - Count Addr(BufferE)
Primitive SrcOp1 SrcOp2 SrcOp3 Dst
- - Count DataFlowIdx, Addr(BufferC)
DataFlowIdx, Addr(BufferC) - Count -
DataFlowIdx, Addr(Buffer) StreamIdx Count DataFlowIdx, Addr(BufferD)
Addr(BufferC) Addr(BufferE) Count -











Figure 8·8: Macro-instruction set (VLIW-like) for RNN dataflow architec-





RowIdx 1…Δnh ColIdx 1…Δmh
























(a) Partition of 







Figure 8·9: Micro-instruction indicates the kernel matrix workload and the
scheduling of partition for workload balancing.
translated to the directed acyclic graph (DAG), in which the nodes represent the arithmetic
primitives and the edges are data dependencies. In the list scheduling, we adopt the as soon
as possible (ASAP) strategy that the operation nodes are mapped to the corresponding
hardware resources once the resource is available and the dependency edge is ready. With
the proper operation units and interconnection in the RNN dataflow architecture, the macro-
instruction compilation can quickly achieve an optimum point, in which the processing
throughput is bounded by the main workload (CSB-MVM) on CSB-Engine.
196
8.5.2 Workload Scheduling on CSB-Engine
Micro-Instruction Set
The micro-instructions are generated for each PEGroup individually, which control the
CSB-MVM operations on CSB-Engine. Specifically, the micro-instruction contains
the CSB-compression information (i.e., kernel matrix size, row- and column-index in
Fig. 8·3(c)) for the block workload allocated to the certain PEGroup. In particular, the
kernel matrix workload is partitioned to three submatrices and shared to neighboring
PEGroups (as Fig. 8·9(a)), the micro-instructions for one block iteration include three
items, (i) local workload that is originally allocated, excluding the portion shared to other
PEGroups; (ii) workload shared from the neighboring PEGroup in horizontal; (iii) work-
load shared from the neighboring PEGroup in vertical. The micro-instruction contains
4 operands, as Fig. 8·9(b). The operand Sharing gives a flag (local/horizontal/vertical)
to indicate the data source, where local means the input and output neurons are in lo-
cal PEGroup; horizontal (sharing) indicates the input neurons should be read from the
BlockNeuronBuffer of left PEGroup; Vertical (sharing) means the output should be ac-
cumulated to the NeuronAccumBuffer of upper PEGroup. The operand TripCount gives
the size of workload. Note that, for each block, the kernel matrix is divided to tree regular
partitions as Fig. 8·9(a), for local (no-sharing), vertical- and horizontal-sharing, respec-
tively. The sizes of partitioned matrices are denoted as m′× n′, ∆mv×∆nv, ∆mh×∆nh,
which are turned to TripCount values in the three micro-instruction items. The operands
RowIdx and ColIdx provide the non-zero row and column indices of each submatrix. Note
that each micro-instruction item may contain multiple RowIdx and ColIdx corresponding
to the TripCount value. Further, these two operands are stored in individual instruction
memories that are read and reused periodically in the submatrix computation.
197
Micro-Instruction Compilation
The compilation of micro-instruction is essentially searching the workload partition scheme
to achieve the optimal balance, which facilitates a higher hardware utilization (efficiency).
Specifically, the compiler analyzes the weight matrices and selects the proper partition
variables (as Fig. 8·9(a)) for each kernel matrix. Every K×L blocks (one block iteration)
are analyzed individually, which are executed on PEGroups in concurrent. Within one block
iteration, each PEGroup is supposed to take the equivalent workload after balancing.
We regard the search of optimal partition variable as a satisfiability modulo theories
(SMT) problem [Winter and Smith, 1992], which searches the feasible solutions in the
constrained region. The existing SMT solver [De Moura and Bjørner, 2008] takes the con-
straints with logic programming and gives the satisfiability (existence of solution) and fea-
sible solutions. In the compilation for each block iteration, we declare the integer variables
including m′(k, l), n′(k, l), ∆mh(k, l), ∆nh(k, l), ∆mv(k, l), ∆nv(k, l), where k ∈ [1,K] and
l ∈ [1,L]. The constraints are represented with the constraint logic programming (CLP),
in which each clause gives a specific search limitation. The CLP in compilation is listed
in Eqn. 8.7, where ∧ represents logic AND and ∨ represents OR. CLP1,2 constraint the
feasible search region, as the size of the partitioned workload should ≤ kernel matrix size
(m×n). CLP3,4 guarantee regular partitions as Fig. 8·9(a). CLP5 determines the values of m′
and n′. To improve the PEGroup utilization, we set CLP6 constraint that the size of partition
workload should be integer-multiple of the PEGroup size. Thus, the PEs are fully utilized
on the shared workload partition. Also, it helps to shrink the search space and speed up
the compilation. Within the idealized situation, each PEGroup is scheduled with workload
that is the average value over all PEGroups in the current block iteration. Otherwise, the
PEGroup with maximum workload determines the run time (clock cycle) for this iteration.
CLP7 gives the constraint on the maximum workload that, to all PEGroups, the exceeding
part of scheduled workload to the average value (avg) should ≤ margin, which is given
198
before search. The last CLP combines all above constraints to a conjunctive form, which is
subsequently sent to SMT-solver for a feasible solution.
CLP1 : {0≤ ∆mh(k, l)≤ m(k, l)}∧{0≤ ∆nh(k, l)≤ n(k, l)}
CLP2 : {0≤ ∆mv(k, l)≤ bm(k, l)/2c}∧{0≤ ∆nv(k, l)≤ n(k, l)}
CLP3 : {∆mh(k, l) = m(k, l)}∧{∆nv(k, l)+∆nh(k, l) = n(k, l)}
CLP4 : {∆nv(k, l) = n(k, l)}∧{∆mh(k, l)+∆mv(k, l) = m(k, l)}
CLP5 : {m′(k, l) = m(k, l)−∆mv(k, l)}∧{n′(k, l) = n(k, l)−∆nh(k, l)}
CLP6 : {∆mh(k, l)%P = m′v(k, l)%P = 0}∧{∆nh(k, l)%Q = n′v(k, l)%Q = 0}
CLP7 : | (m′(k, l)×n′(k, l)+∆mh(k, l−1)×∆nh(k, l−1)
+∆mv(k−1, l)×∆nv(k−1, l))−avg | ≤ margin
CLP : CLP1∧CLP2∧ (CLP3∨CLP4)∧CLP5∧CLP6∧CLP7 (8.7)
Algorithm 2: Micro-Instruction Compilation
input : CSB pruned weight matrix Wcsb;
block size in CSB M×N; weight matrix size W ×H;
size of each PEGroup P×Q; PRGroup count K×L
output: Micro-instruction list MicroInst
// Temporal block iterations in vertical.
for i← 1 to dH/N/Ke do
// Temporal block iterations in horizontal.
for j← 1 to dW/M/Le do
margin=0
// ∀k ∈ [1,K], ∀l ∈ [1,L].
[m(k, l),n(k, l), avg]=Analyze (Wcsb,i,j)
// Search with multiple rounds.
repeat
CLP =BuildCLP (m(k, l),n(k, l), avg, margin)
// Give solution if satisified.




Based on the above formulation, we propose the compilation scheme in Algorithm 2
199
that seeks out the optimal scheduling solution. For a given CSB formatted weight matrix
Wcsb, the compiler partitions it to dW/M/Le× dH/N/Ke temporal block iterations and
schedules each iteration individually. Before the multi-round search, the compiler firstly
analyzes the weight partition for current block iteration that gives the kernel matrix size
(m,n) for each block and the average workload (avg). The margin is initialized to 0 that
targets to schedule an idealized average workload on each PEGroup. In the search round,
BuildCLP constructs the constraints representation, which is input to SMTSolver. In case
the constraints cannot be satisfied (Satisfiability is False) over the feasible region, the
margin value is supposed to increase by P×Q in the next round. Once the SMT problem
is satisfied, the search stops and the partition variables (m′, n′, ∆mv, ∆nv, ∆mh, ∆nh) for
each PEGroup are assembled and appended to the micro-instruction list, that conducts the
CSB-Engine computation in a workload balanced fashion.
8.6 Evaluation
In this section, we first brief the implementation of the CSB-RNN framework (Section
8.6.1), and then give deep evaluations from the performance of CSB pruning algorithm
(Section 8.6.2) to the improvement with the architecture-compilation co-design (Section
8.6.3). Meanwhile, 10 mainstream RNN models from multi-domains are invoked as the
evaluation benchmarks and presented in Table 8.1, in which we also list the non-structured
pruning rates as the theoretical optimum.
8.6.1 Implementation and Experiments Setup
The CSB pruning flow was implemented with PyTorch [Paszke et al., 2019], a framework
for deep learning model development. The benchmark models were first trained with the
SGD and the accuracy is regarded as the lossless target value in the subsequent CSB prun-
ing. These baseline models were fed in the CSB pruning flow and get compressed with
200
the lossless constraints. In regarding the architecture-compilation co-design, the proposed
RNN dataflow architecture was realized with Verilog RTL and implemented on an FPGA
vendor evaluation board (Xilinx-ZCU102), on which the FPGA contains enough resources
for our architecture with different design scales. The compiler was implemented in C++
with the strategies in Section 8.5 and Z3 [De Moura and Bjørner, 2008] as the SMT solver.
With the CSB pruned model, the compiler dumps the macro-instructions (Section 8.5.1)
to build the proper RNN dataflow and micro-instructions (Section 8.5.2) for the workload
balancing. These instructions are loaded to the RNN dataflow architecture before process-
ing sequence continuously. A cycle-level RTL simulator was built to profile the detailed
hardware efficiency.
8.6.2 Evaluation of CSB pruning Rate
The CSB pruning is first evaluated in the aspect of pruning rate, which is a significant met-
ric to score the model compression methods. As the parameterizable block size determines
the structural granularity in pruning, we present the attainable maximum pruning rate with
various block sizes. Further, comparison with the prior art RNN compression schemes is
also given in this subsection.
Selection of Optimum Structural Granularity
CSB pruning provides the flexibility that improves the pruning rate and also the hardware-
friendly regularity. Importantly, a trade-off exists between these two targets which moti-
vates the following investigation. Reducing the block size facilitates a more fine-grained
pruning and thus a higher pruning rate. However, more individual blocks require extra stor-
age for row and column index with the CSB-formatted weight matrix (Fig. 8·3). Therefore,
we present both the attainable pruning rate and the index overhead with different block
sizes in each benchmark model. The block is set to square with sizes of 16, 32, 64, 128,
considering the weight matrix dimensions in different models. Note that for matrix with
201
Table 8.1: Benchmark Models in CSB-RNN Evaluation. MT: Machine
Translation; SR: Speech Recognition; SPP: Stock Price Prediction; SC:
Sentiment Classification; QA: Question Answering; App: Application; #L:
Layer; LI: Layer Index; IN: Input Neuron; HN: Hidden Neuron; EM: Eval-
uation Metric; PPL: Perplexity; PER: Phoneme Error Rate; Acc: Accuracy;
NPD: Normalized Price Dist. Datasets used are PTB [Marcus et al., 1993],
TIMIT [Garofolo et al., 1993], TDIGIT [Leonard et al., 1993], S&P500,
IMDB [Maas et al., 2011], MR [Pang and Lee, 2005], and BABI [Weston
et al., 2015]. RNN cell types used are LSTM [Hochreiter and Schmidhuber,
1997], LSTMP [Sak et al., 2014], GRU [Cho et al., 2014], and Li-GRU [Ra-
vanelli et al., 2018]
App Dataset #L Cell LI IN HN EM
Original Model Non-Structued Pruning
#Weight+Bias Result PruneRate #Weight Result







2 256 256 524K+1K 13.2× 39.7K







4 1500 1500 18M+6K 16.3× 1.1M







6 512 1024 4.72M+4K 14.5× 325.4K







8 1024 1024 6.3M+3K 21.7× 289.9K







10 512 512 1M 7.1× 147.7K
6 SR4 TDIGIT 1 GRU 11 39 256 Acc 226.6K+0.8K 99.98% 25.7× 8.8K 99.21%







13 128 128 131K+0.5K 4.1× 32K






85.65%15 512 512 2.1M+2K 10.4× 201.6K
16 512 512 2.1M+2K 10.4× 201.6K
9 SC2 MR 1 LSTM 17 50 256 Acc 313.3K+1K 78.23% 7.2× 43.5K 76.31%






64.51%19 256 256 524.3K+1K 7.9× 66.4K
20 256 256 524.3K+1K 7.9× 66.4K
very small size (e.g., 256×39 in SR4), the short dimension (39) is partitioned to Q blocks
uniformly after padding a zero-column. Multiple layers in one model adopt the same prun-
ing rate. The attainable pruning rate for each case is presented in Fig. 8·10(a); Further,
the index overheads are divided by the corresponding weight count for normalization, and
the values of the normalized index overhead (NIO) are presented in Fig. 8·10(b). Notably,
the results with non-structured pruning are given for comparison (leftmost bar for each ap-
plication); And its index overhead is obtained by compressing the non-structured weight
202
Figure 8·10: (a) shows the pruning rate comparison between non-structured
pruning (optimum) and CSB pruning in different block sizes. (b) shows
the normalized index overhead (NIO). Comparing (a) and (b), we gain the
insight that CSB pruning dramatically reduces the NIO while maintaining a
high pruning rate.
matrices with the compressed sparse row (CSR) format.
As a result, the CSB pruning rate ranges from 3.5× to 25×, which dramatically reduces
the original model size by order of magnitude. With the growth of block size, the pruning
rate decreases as the coarse-granularity block reduces the pruning flexibility. We note that,
in all benchmarks, the CSB pruning is capable of reaching a maximum pruning rate with the
block size of 16 or 32, which is close to non-structured pruning. In the aspect of NIO, the
index overhead of non-structured pruning exceeds 100%, as at least one index is required
203










column pruning [Wang et al., 2019b] 8× 16-bit
PPL
112.73 1×
CSB pruning 12.5× 16-bit 112.02 1.6×
MT2
row-column [Wen et al., 2018] 3× floating
PPL
82.59 1×
bank balanced [Cao et al., 2019] 5× 16-bit 82.59 1.65×
CSB pruning 12× 16-bit 82.33 3.9×
SR1
block circulant [Wang et al., 2018a] 8× 16-bit
PER
24.57% 1×
row balanced [Han et al., 2017] 8.9× 16-bit 20.70% 1.1×
bank balanced [Cao et al., 2019] 10× 16-bit 23.50% 1.3×
CSB pruning 13× 16-bit 20.10% 1.6×
SR2
block circulant [Li et al., 2019b] 8× 16-bit
PER
20.20% 1×
CSB pruning 20× 16-bit 20.01% 2.5×
SR4
column pruning [Gao et al., 2018a] 14.3× 16-bit
Acc
98.43% 1×
CSB pruning 23× 16-bit 99.01% 1.6×
for a non-zero element. Nevertheless, for CSB pruning, the NIO is below 50% in most
cases due to index reusability in the structured blocks. The NIO shows a significant decay
while enlarging the block size. With the block size of 32, the NIO declines to ≈ 20%,
which is 1/5 of that in non-structured pruning. Interestingly, we gain the insight that with
a block size of 32 and 16, most models achieve the close pruning rate. For instance, 13×
and 12× in MT1; both are 20× in SR2. Therefore, the larger block size (32) is preferable
for its low index overhead.
Comparison with Prior Art Compression Schemes
The CSB pruning rate is further compared to the prior art RNN compression techniques
in Table 8.2. The listed competitive techniques are proposed to enable a faster, hardware-
friendly RNN inference with the compressed model. Note that these competitors quan-
tized the weight to 16-bit fixed-point numbers; Thus, we do the same quantization on CSB
pruned model and report the corresponding results for a fair comparison. In Table 8.2,
row-column [Wen et al., 2018] technique prunes each weight matrix as an entire block.
Comparing to it, our fine-grained CSB pruning improves the compression rate to 3.9×.
204
The row balanced [Han et al., 2017] or bank balanced [Cao et al., 2019] techniques com-
pulsively train the model to a balanced sparse pattern; However, CSB pruning remains
the natural unbalanced sparsity in RNN model and achieves a higher (1.6×) pruning rate.
Overall, the CSB pruning improves the pruning rate to 1.6×-3.9× of the existing schemes,
while maintaining an even better model accuracy.
8.6.3 Evaluation of RNN dataflow Architecture with CSB Pruned Model
Hardware-resource Consumption
The hardware-resource consumption (cost) of the RNN dataflow architecture is given in
Fig. 8·11, with various CSB-Engine configs (P,Q,K,L and max supported block size).
Notably, the CSB-Engine with different workload sharing configs, including no-sharing,
vertical-sharing, horizontal-sharing, 2D-sharing, are synthesized individually to evaluate
the hardware overhead on workload sharing technique. The consumption of hardware
logic and memory from the FPGA vendor tool are presented in Fig. 8·11. The configurable
logic block (CLB, left axis) is the FPGA building block for logic, which is used as the logic
resource metric; The memory resource is given in megabit (Mb in the right axis). Note that
most memory resource on our FPGA device is configured as the weight buffer, although
they may not be fully used by small RNN models. The multiplier in each PE (16-bit fixed-
point) is mapped to digital signal processor (DSP) on FPGA, and the DSP count in design
is ≈ P×Q×K× L that is omitted here. As a result, the hardware support of workload
sharing costs an acceptable overhead, which is 11.6%, 3.8%, and 15.6% for three sharing
cases (vertical/horizontal/2D-sharing), respectively.
Performance
Due to the workload imbalance issue, the processing performance of RNN dataflow archi-
tecture, CSB-Engine in specific, is not deterministic. Hardware efficiency, the ratio of effec-





































Figure 8·11: Hardware resource consumption with multi CSB-Engine con-
figs.
technique. We obtained the CSB-Engine efficiency by measuring the PE pipeline utilization
using 10 benchmarks listed in Table 8.1 with different design choices of workload sharing.
Moreover, CSB pruned models with different block sizes are used to evaluate the impact of
block size on efficiency. The efficiency is measured layer-by-layer on hardware with 4×4
PEGroups and each contains 4× 4 PEs. The results are presented in Fig. 8·12. Overall,
for the CSB-Engine without workload sharing, the efficiency is 42% on average, which
results from the imbalanced workload (sparsity) of blocks. The single dimensional sharing
(vertical or horizontal) improves the efficiency to an average of 72%. After the 2D-sharing
is adopted, the efficiency is further improved to 94% on average, i.e., only 6% execution
time of CSB-Engine is invalid. This 6% pipeline gap is inevitable, as a few extremely im-
balanced sparsity exists in some weight matrices. For instance, we found diagonal dense
matrix exists that the blocks on the matrix diagonal contain significant workload compared
to other blocks. In this case, the workload sharing path in the current design is not enough,
while adding more sharing paths brings extra hardware costs.
Comparing the efficiency within the same layer but different pruning block sizes, it is
apparent that the smaller block size is applied, the lower hardware efficiency CSB-Engine
can achieve, particularly in the no-sharing CSB-Engine cases. This is because the small
206
block includes less workload (with the same pruning rate) but more temporal block iter-
ations, which lead to PE idle more easily. As mentioned in Section 8.6.2, using smaller
block sizes in compression guarantees higher model pruning rates, which benefits are sig-
nificantly encroached by the performance degradation with small compression block in the
no-sharing cases. Nevertheless, we gain the insight that our architecture-compilation co-
design for 2D-sharing cases significantly subdues the degradation. For instance, in Layer-2
(L2) of MT1 case, the no-sharing degradation from block-64 to block-32 is 12%, while it
is reduced to 3% by the 2D-sharing. On average, the degradation is reduced from 15% to
4%. In summary, with the proposed workload sharing technique, a smaller block size in
CSB pruning does not bring significant degradation on hardware efficiency anymore (only
4% on average), so that the benefits from higher pruning rates can be more sufficiently
exploited.
Comparison with Related Work
The overall performance of CSB-RNN, i.e., CSB pruned model inference speed on the
proposed RNN dataflow architecture, is listed in Table 8.3 and compared with the prior
art designs. We collected the statistics including the PE count (#PE), operating frequency,
latency in processing one input frame and the power of design. As Table 8.3 shows, with the
same benchmark applications, the CSB-RNN reduces the latency by 39%-92% that speeds
up the processing from 1.12× to 12.57× correspondingly; Nevertheless, CSB-RNN only
uses 19%-34% PE counts (hardware resource) of the competitors to attain this performance.
The latency ranges from 0.79µs to 6.58µs with different model sizes. For generic high-
precision speech recognition, at most≈ 2000 frames should be processed per second, which
requires a latency ≤ 500µs to meet the realtime performance. As the achieved latency
with benchmark models is much lower than this requirement, the CSB-RNN provides a
faster-than-realtime performance and facilitates the device processing more complex RNN
models in the future. Besides the latency, we compare the power efficiency (k-frames
207
No-Sharing Vertical-Sharing Only Horizontal-Sharing Only 2D-Sharing, with both vertical & horizontal
Figure 8·12: The efficiency (utilization) of the proposed architecture with
different sharing strategies. The novel workload sharing technique signifi-
cantly improves the average efficiency from 42% (no-sharing) to 94% (2D-
sharing). This improvement fully exploits the benefits of fine-grained CSB
pruning.
per Watt) among these competitive designs. The results show the CSB-RNN achieves
significant improvements from 3.53× to 58.89× on power efficiency in processing the
same model, which makes the CSB-RNN quite suitable for embedded scenarios. Further,
while the existing works were designed for a particular RNN cell type, CSB-RNN can be
reprogrammed to adapt to different cells.
8.7 Conclusion
This chapter presents CSB-RNN, an optimized full-stack RNN acceleration framework.
The fine-grained structured CSB pruning significantly improves the pruning rate compared
208
Table 8.3: Latency and Power Efficiency Comparison
Abbr. Work #PE Freq. Latency Power Power Eff. Power Eff.
(MHz) (µs) (Watt) (k-frames/W) Improv.
MT1
BBS [Cao et al., 2019] 1518 200 1.30 19 40.49 1×
CSB-RNN 512 200 0.79 8.9 142.72 3.53×
SR1
C-LSTM [Wang et al., 2018a] 2680 200 8.10 22 5.61 19.35×
E-RNN [Li et al., 2019b] 2660 200 7.40 24 5.63 19.41×
ESE [Han et al., 2017] 1504 200 82.70 41 0.29 1×
CSB-RNN 512 200 6.58 8.9 17.08 58.89×
SR2
E-RNN [Li et al., 2019b] 2280 200 6.70 29 5.15 1×
CSB-RNN 512 200 5.18 8.9 21.69 4.21×
to existing hardware-friendly pruning schemes. Meanwhile, an architecture-compilation
co-design is proposed that sufficiently exploits the benefits of the CSB pruned model. The
experiments show that the entire CSB-RNN acceleration framework delivers a faster-than-
realtime performance on extensive RNN models, and dramatically reduces the latency and
improves the power efficiency compared with the existing works.
Future work: We are extending the CSB technique to other neural network layers. In
particular, the transformer models are composed of more complex dataflow, however, the
same MVM primitive as RNN. With improvement on the dataflow abstraction, the proposed
CSB pruning and CSB-Engine will contribute to the realtime transformer inference. Fur-
thermore, this chapter demonstrates the proper model regularization can effectively reduce
RNNs’ irregularity without causing significant accuracy loss.
To summarize, this dissertation suggests both model regularization approach used in
this chapter and the dynamic hardware approaches used in Chapter 3-7 are important and
useful in NN acceleration. They should be selected properly for different applications and
in different scenarios. However, the research and exploration on hardware-only solution is
anyway important. The reasons are as follows: (1) for safety- & mission-critical applica-
tions which cannot use model regularization, hardware-only solutions for irregular models
are helpful to realize the potentials to adopt NNs in these applications; (2) for the appli-
209
cations which are friendly to model regularization, hardware-only solutions help relax the




Conclusions and Future Work
9.1 Conclusions
In this dissertation, we characterize common problems in NN processing and classify them
into four types of irregularities: data-structure-level, operation-level, bit-level, and model-
level. We have created a set of novel HW architectures that improve performance by han-
dling each type of irregularity so that the (original) optimal models – with significant ir-
regularities – can be accelerated with almost perfect hardware efficiency and without loss
on accuracy. We also demonstrate that the performance improvement due to the proposed
methods is compelling. Our thesis is that high-accuracy and high-performance NN in-
ference and training can be achieved by creating a series of novel irregularity-aware
architectures for FPGAs.
Starting with data-structure-level irregularity, we propose AWB-GCN architecture
which is able to execute Sparse Matrices Multiplications (SpMM) with over 90% system
efficiency and efficient off-chip memory access. AWB-GCN is equipped with hardware-
based runtime workload rebalancing techniques to address the workload imbalance issue in
multiplying sparse matrices with significantly irregular non-zero distributions. We evalu-
ate the efficiency of AWB-GCN architecture with Graph Convolutional Networks (GCNs)
which are believed to have more irregular, more sparse, and larger sparse matrices than
traditional DNNs. Experiments show that AWB-GCN can provide considerable speedups
over CPUs (3255×), GPUs (80.3×), and a prior GCN accelerator (5.1×).
We then propose O3BNN-R architecture to address the operation-level irregularity of
211
DNNs. Out-of-Order-BNN (O3BNN-R) architecture is able to detect the redundant op-
erations in Quantized Neural Network (QNN) inference and prune them at runtime. The
redundancy of operations is data-dependent; therefore, the operational pruning opportuni-
ties are unpredictable and irregular. We evaluate our design on an embedded FPGA using
networks that include VGG-16, AlexNet for ImageNet, and a VGG-like network for Cifar-
10. Results show that O3BNN-R can prune, on average, 30% of the operations, without
any accuracy loss, bringing 2.1× inference-speedup, and on average 34× energy-efficiency
improvement over state-of-the-art BNN implementations on FPGAs.
CQNN is proposed to address bit-level irregularity of DNNs. CQNN uses a Coarse-
Grained Reconfigurable Architecture (CGRA) which is composed of basic components
for binary functions. By programming CQNN at runtime according to the target DNN
model, these basic components are integrated to support DNN functions with any data-
width and hyper-parameter requirements. Experimental results show CQNNs can complete
the inference of AlexNet and VGG-16 within 0.13ms and 2.63ms, respectively.
Finally, FPDeep is proposed to deliver high-performance and large-scale DNN training.
FPDeep uses a hybrid of model and layer parallelism to configure distributed reconfigurable
clusters to train DNNs. In order to address model-level irregularity, novel model partition-
ing schemes are proposed to balance workloads and storage among nodes. By using hybrid
parallelism in DNN training, FPDeep avoids the large gaps between training and testing ac-
curacy due to the improper convergence to sharp minimizers; these are inevitable given the
usual approaches that use large training batches. With 250 Gb/s bidirectional bandwidth
per FPGA, which is easily supported by current generation FPGAs, FPDeep performance
shows complete scalability up to 100 FPGAs.
212
9.2 Future Work
We briefly discuss future work. In this dissertation, we create four new architectures to
handle the four types of irregularity in NN processing and demonstrate their efficiency
using four different types of NN models and tasks. Each of these models and tasks has
at least one type of significant irregularity. Currently, these architectures have not been
fully integrated to handle multiple types of irregularities at the same time. However, these
architectures have the potential to be integrated efficiently in a hierarchical way. As future
work, we can build an FPGA-based acceleration framework for irregular models that can
freely integrate several proposed architectures according to the types of irregularities in the
target models.
Figure 9·1 shows a model of this framework. The software stack consists of 4 parts. (1)
The PyTorch-based model description includes the hyper-parameters and configurations of
the target NNs; (2) an accelerator design space explorer (Accelerator DSE) which explores
the design space, decides which architectures need to be used for the target model, and de-
termines the parameters of the selected architectures; (3) a cycle-accurate simulator which
evaluates whether the performance of the design given by accelerator DSE meets the ac-
celeration target given by users. If not, the simulator will send feedback to the accelerator
DSE for architecture tuning; (4) If the simulator verifies that the current design meets the
requirement, the RTL code generator will generate well-optimized FPGA implementation
and map the bitstreams onto FPGA nodes.
The hardware design is based on the architectures proposed in Chapter 3-7. FPGA
nodes work collaboratively with the methods of FPDeep proposed in Chapter 7 to handle
model-level irregularity. Each FPGA node is equipped with a CGRA-based architecture,
CQNN, proposed in Chapter 6, to handle the bit-level irregularity. The architecture of
O3BNN-R, discussed in Chapter 5, is used as one of the basic function units in the CGRA
array. To handle operation-level irregularity, each O3BNN-R unit is equipped with massive
213
Figure 9·1: FPGA-based acceleration framework for irregular models and
the integration of the architectures of AWB-GCN, FPDeep, O3BNN-R and
CQNN.
PEs which are connected in a circular communication network for efficient processing and
removal of operational redundancy. The PEs in the O3BNN-R engine can be replaced with
the AWB-GCN architecture proposed in Chapter 3. The AWB-GCN engine can perform
SpMMs with almost perfect hardware efficiency no matter how much irregularity the data-
structure-level the target models have.
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis,
A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I. J., Harp, A., Irving, G., Isard,
M., Jia, Y., Józefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga,
R., Moore, S., Murray, D. G., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever,
I., Talwar, K., Tucker, P. A., Vanhoucke, V., Vasudevan, V., Viégas, F. B., Vinyals, O.,
Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2016). TensorFlow:
large-scale machine learning on heterogeneous distributed systems. Computing Re-
search Repository (CoRR) in arXiv, abs/1603.04467.
Abou-Rjeili, A. and Karypis, G. (2006). Multilevel algorithms for partitioning power-
law graphs. In Proceedings 20th IEEE International Parallel Distributed Processing
Symposium, pages 10 pp.–. doi: 10.1109/IPDPS.2006.1639360.
Abu-El-Haija, S., Perozzi, B., Al-Rfou, R., and Alemi, A. (2018). Watch your step:
Learning node embeddings via graph attention. In Proceedings of the 32nd Interna-
tional Conference on Neural Information Processing Systems, page 9198–9208. doi:
10.5555/3327546.3327592.
Adamic, L. A., Lukose, R. M., Puniyani, A. R., and Huberman, B. A. (2001). Search in
power-law networks. Physical Review E. doi: 10.1103/PhysRevE.64.046135.
Aiello, W., Chung, F., and Lu, L. (2001). A random graph model for power law graphs.
Experimental Mathematics, 10(1):53–66. doi: 10.1080/10586458.2001.10504428.
Akhlaghi, V., Yazdanbakhsh, A., Samadi, K., Gupta, R. K., and Esmaeilzadeh, H. (2018).
SnaPEA: Predictive early activation for reducing computation in deep convolutional neu-
ral networks. In 2018 ACM/IEEE 45th Annual International Symposium on Computer
Architecture (ISCA), pages 662–673. doi: 10.1109/ISCA.2018.00061.
Albericio, J., Judd, P., Hetherington, T., Aamodt, T., Jerger, N. E., and Moshovos, A.
(2016). Cnvlutin: ineffectual-neuron-free deep neural network computing. In Proceed-
ings of the 43rd International Symposium on Computer Architecture, page 1–13. IEEE
Press. doi: 10.1109/ISCA.2016.11.
Alwani, M., Chen, H., Ferdman, M., and Milder, P. (2016). Fused-layer CNN accelerators.




Andri, R., Cavigelli, L., Rossi, D., and Benini, L. (2018). YodaNN: an architecture for
ultralow power binary-weight CNN acceleration. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, 37(1):48–60. doi: 10.1109/T-
CAD.2017.2682138.
Asgari, B., Hadidi, R., Krishna, T., Kim, H., and Yalamanchili, S. (2020). ALRESCHA: a
lightweight reconfigurable sparse-computation accelerator. In 2020 IEEE International
Symposium on High Performance Computer Architecture (HPCA), pages 249–260. doi:
10.1109/HPCA47549.2020.00029.
Ashari, A., Sedaghati, N., Eisenlohr, J., Parthasarathy, S., and Sadayappan, P. (2014). Fast
sparse matrix-vector multiplication on GPUs for graph applications. In Proceedings
of the International Conference for High Performance Computing, Networking, Storage
and Analysis, page 781–792. IEEE Press. doi: 10.1109/SC.2014.69.
Bell, N. and Garland, M. (2008). Efficient sparse matrix-vector multiplication on CUDA.
Technical report, Nvidia Technical Report NVR-2008-004, Nvidia Corporation. url:
https://www.nvidia.com/docs/IO/66889/nvr-2008-004.pdf.
Bell, N. and Garland, M. (2009). Implementing Sparse Matrix-Vector Multiplication on
throughput-oriented processors. In Proceedings of the Conference on High Performance
Computing Networking, Storage and Analysis. Association for Computing Machinery.
doi: 10.1145/1654059.1654078.
Ben-Nun, T. and Hoefler, T. (2019). Demystifying parallel and distributed deep learn-
ing: An in-depth concurrency analysis. ACM Computing Surveys, 52(4). doi:
10.1145/3320060.
Benkrid, K. and Vanderbauwhede, W., editors (2013). High Performance Computing Us-
ing FPGAs. Springer Verlag. doi: 10.1007/978-1-4614-1791-0_4.
Bethge, J., Bartz, C., Yang, H., Chen, Y., and Meinel, C. (2020). MeliusNet: can binary
neural networks achieve MobileNet-level accuracy? Computing Research Repository
(CoRR) in arXiv, abs/2001.05936.
Blott, M., Preußer, T. B., Fraser, N. J., Gambardella, G., O’brien, K., Umuroglu, Y., Leeser,
M., and Vissers, K. (2018). FINN-R: an end-to-end deep-learning framework for fast
exploration of quantized neural networks. ACM Transactions on Reconfigurable Tech-
nology and Systems, 11(3). doi: 10.1145/3242897.
Blott, M., Preußer, T. B., Fraser, N., Gambardella, G., O’Brien, K., Umuroglu, Y., and
Leeser, M. (2017). Scaling neural network performance through customized hardware
architectures on reconfigurable logic. In 2017 IEEE International Conference on Com-
puter Design, pages 419–422. doi: 10.1109/ICCD.2017.73.
216
Boku, T., Kobayashi, R., Fujita, N., Amano, H., Sano, K., Hanawa, T., and Yamaguchi, Y.
(2019). Cygnus: GPU meets FPGA for HPC. In International Conference on Supercom-
puting. https://www.r-ccs.riken.jp/labs/lpnctrt/assets/img/ lspanc2020jan_boku_light
.pdf.
Bolaria, J. and Byrne, J. (2009). A Guide to FPGAs for Communications. The Linley
Group.
Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In
Proceedings of COMPSTAT’2010, pages 177–186. Springer.
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. (2011). Distributed op-
timization and statistical learning via the alternating direction method of multipliers.
Foundations and Trends® in Machine learning, 3(1):1–122.
Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. (2014). Spectral networks and lo-
cally connected networks on graphs. In the 2nd International Conference on Learning
Representations.
Canziani, A., Paszke, A., and Culurciello, E. (2016). An analysis of deep neural network
models for practical applications. Computing Research Repository (CoRR) in arXiv,
abs/1605.07678.
Cao, S., Zhang, C., Yao, Z., Xiao, W., Nie, L., Zhan, D., Liu, Y., Wu, M., and Zhang, L.
(2019). Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity. In
Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays, pages 63–72.
Caulfield, A. M., Chung, E. S., Putnam, A., Angepat, H., Fowers, J., Haselman, M., Heil,
S., Humphrey, M., Kaur, P., Kim, J.-Y., Lo, D., Massengill, T., Ovtcharov, K., Pa-
pamichael, M., Woods, L., Lanka, S., Chiou, D., and Burger, D. (2016). A cloud-scale
acceleration architecture. In The 49th Annual IEEE/ACM International Symposium on
Microarchitecture. IEEE Press. doi: 10.5555/3195638.3195647.
Chen, J., Monga, R., Bengio, S., and Józefowicz, R. (2016). Revisiting distributed syn-
chronous SGD. Computing Research Repository (CoRR) in arXiv, abs/1604.00981.
Chen, J., Zhu, J., and Song, L. (2018). Stochastic training of graph convolutional net-
works with variance reduction. In Proceedings of the 35th International Conference on
Machine Learning, volume 80, pages 942–950.
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C.,
and Zhang, Z. (2015). MXNet: a flexible and efficient machine learning library for
heterogeneous distributed systems. Computing Research Repository (CoRR) in arXiv,
abs/1512.01274.
217
Chen, Y., Krishna, T., Emer, J. S., and Sze, V. (2017). Eyeriss: an energy-efficient reconfig-
urable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State
Circuits, 52(1):127–138. doi: 10.1109/JSSC.2016.2616357.
Chen, Y., Li, K., Yang, W., Xiao, G., Xie, X., and Li, T. (2019). Performance-aware
model for sparse matrix-matrix multiplication on the Sunway TaihuLight supercom-
puter. IEEE Transactions on Parallel and Distributed Systems, 30(4):923–938. doi:
10.1109/TPDS.2018.2871189.
Chiu, M. and Herbordt, M. (2009). Efficient filtering for molecular dynamics simulations.
In 2009 International Conference on Field Programmable Logic and Applications. doi:
10.1109/ FPL15426.2009.
Chiu, M. and Herbordt, M. (2010). Molecular Dynamics simulations on high performance
reconfigurable computing systems. ACM Transactions on Reconfigurable Technology
and Systems, 3(4):1–37. doi: 10.1145/1862648.1862653.
Chiu, M., Herbordt, M., and Langhammer, M. (2008). Performance potential of Molec-
ular Dynamics simulations on high performance reconfigurable computing systems. In
2008 Second International Workshop on High-Performance Reconfigurable Computing
Technology and Applications. doi: 10.1109/ HPRCTA.2008.4745685.
Chiu, M., Khan, M., and Herbordt, M. (2011). Efficient calculation of pairwise nonbonded
forces. In 2011 IEEE 19th Annual International Symposium on Field-Programmable
Custom Computing Machines. doi: 10.1109/ FCCM.2011.34.
Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.,
and Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for
statistical machine translation. In Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pages 1724–1734.
Chung, E., Fowers, J., Ovtcharov, K., Papamichael, M., Caulfield, A., Massengill, T., Liu,
M., Lo, D., Alkalay, S., Haselman, M., Abeydeera, M., Adams, L., Angepat, H., Boehn,
C., Chiou, D., Firestein, O., Forin, A., Gatlin, K. S., Ghandi, M., Heil, S., Holohan,
K., El Husseini, A., Juhasz, T., Kagi, K., Kovvuri, R. K., Lanka, S., van Megen, F.,
Mukhortov, D., Patel, P., Perez, B., Rapsang, A., Reinhardt, S., Rouhani, B., Sapek, A.,
Seera, R., Shekar, S., Sridharan, B., Weisz, G., Woods, L., Yi Xiao, P., Zhang, D., Zhao,
R., and Burger, D. (2018). Serving DNNs in real time at datacenter scale with project
Brainwave. IEEE Micro, 38(2):8–20. doi: 10.1109/MM.2018.022071131.
Chung, F., Lu, L., and Vu, V. (2004). The spectra of random graphs with given expected
degrees. Internet Mathematics, 1(3):257–275. doi: 10.1080/15427951.2004.
10129089.
218
Coley, C. W., Jin, W., Rogers, L., Jamison, T. F., Jaakkola, T. S., Green, W. H., Barzilay, R.,
and Jensen, K. F. (2019). A graph-convolutional neural network model for the prediction
of chemical reactivity. Chemical Science, 10:370–377. doi: 10.1039/C8SC04228D.
Cong, J., Sarkar, V., Reinman, G., and Bui, A. (2011). Customizable domain-specific
computing. IEEE Design Test of Computers, 28(2):6–15. doi: 10.1109/MDT.2010.
141.
Courbariaux, M., Bengio, Y., and David, J.-P. (2015). BinaryConnect: Training deep
neural networks with binary weights during propagations. In Proceedings of the 28th
International Conference on Neural Information Processing Systems, page 3123–3131.
MIT Press. doi: 10.5555/2969442.2969588.
Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., and Bengio, Y. (2016). Binarized
neural networks: Training deep neural networks with weights and activations constrained
to +1 or -1. Computing Research Repository (CoRR) in arXiv, abs/1602.02830.
Dai, H., Kozareva, Z., Dai, B., Smola, A., and Song, L. (2018). Learning steady-states of
iterative algorithms over graphs. In Proceedings of the 35th International Conference
on Machine Learning, volume 80, pages 1106–1114.
De Moura, L. and Bjørner, N. (2008). Z3: An efficient smt solver. In International
conference on Tools and Algorithms for the Construction and Analysis of Systems, pages
337–340. Springer.
Defferrard, M., Bresson, X., and Vandergheynst, P. (2016). Convolutional neural networks
on graphs with fast localized spectral filtering. In Proceedings of the 30th Interna-
tional Conference on Neural Information Processing Systems, page 3844–3852. doi:
10.5555/3157382.3157527.
Deveci, M., Trott, C., and Rajamanickam, S. (2017). Performance-portable sparse matrix-
matrix multiplication for many-core architectures. In 2017 IEEE International Par-
allel and Distributed Processing Symposium Workshops, pages 693–702. IEEE. doi:
10.1109/IPDPSW.2017.8.
Ding, C., Liao, S., Wang, Y., Li, Z., Liu, N., Zhuo, Y., Wang, C., Qian, X., Bai, Y., Yuan, G.,
et al. (2017). CirCNN: accelerating and compressing deep neural networks using block-
circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International
Symposium on Microarchitecture, pages 395–408. doi: 10.1145/3123939.3124552.
Eran, H., Zeno, L., Tork, M., Malka, G., and Silberstein, M. (2019). NICA: An Infras-
tructure for Inline Acceleration of Network Applications. In USENIX Annual Technical
Conference.
Fey, M. and Lenssen, J. E. (2019). Fast graph representation learning with pytorch geo-
metric. Computing Research Repository (CoRR) in arXiv, abs/1903.02428.
219
Fortin, M. and Glowinski, R. (2000). Augmented Lagrangian methods: applications to the
numerical solution of boundary-value problems. Elsevier.
Fowers, J., Ovtcharov, K., Papamichael, M., Massengill, T., Liu, M., Lo, D., Alkalay,
S., Haselman, M., Adams, L., Ghandi, M., Heil, S., Patel, P., Sapek, A., Weisz, G.,
Woods, L., Lanka, S., Reinhardt, S. K., Caulfield, A. M., Chung, E. S., and Burger,
D. (2018). A configurable cloud-scale DNN processor for real-time AI. In 2018
ACM/IEEE 45th Annual International Symposium on Computer Architecture, pages 1–
14. doi: 10.1109/ISCA.2018.00012.
Fujii, T., Sato, S., and Nakahara, H. (2018). A threshold neuron pruning for a binarized
deep neural network on an FPGA. IEICE Transactions on Information and Systems,
E101.D(2):376–386. doi: 10.1587/transinf.2017RCP0013.
Gao, C., Neil, D., Ceolini, E., Liu, S.-C., and Delbruck, T. (2018a). Deltarnn: A power-
efficient recurrent neural network accelerator. In Proceedings of the 2018 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, pages 21–30. ACM.
Gao, H., Wang, Z., and Ji, S. (2018b). Large-scale learnable graph convolutional networks.
In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Dis-
covery & Data Mining, pages 1416–1424. Association for Computing Machinery. doi:
10.1145/3219819.3219947.
Gao, J., Ji, W., Tan, Z., and Zhao, Y. (2020). A systematic survey of general sparse
matrix-matrix multiplication. Computing Research Repository (CoRR) in arXiv,
abs/2002.11273.
Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., and Pallett, D. S. (1993). DARPA
TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1.
NASA STI/Recon Technical Report NISTIR 4930, 93.
Geng, T., Diken, E., Wang, T., Jozwiak, L., and Herbordt, M. (2018a). An access-pattern-
aware on-chip vector memory system with automatic loading for SIMD Architectures.
In 2018 IEEE High Performance Extreme Computing Conference. doi: 10.1109/H-
PEC.2018.8547551.
Geng, T., Li, A., Shi, R., Wu, C., Wang, T., Li, Y., Haghi, P., Tumeo, A., Che, S.,
Reinhardt, S., and Herbordt, M. C. (2020). AWB-GCN: a Graph Convolutional Net-
work accelerator with runtime workload rebalancing. In 2020 53rd Annual IEEE/ACM
International Symposium on Microarchitecture, pages 922–936. doi: 10.1109/MI-
CRO50266.2020.00079.
Geng, T., Li, A., Wang, T., Song, S., and Herbordt, M. (2018b). Binarized ImageNet
inference in 29µs. In International Conference for High Performance Computing, Net-
working, Storage and Analysis.
220
Geng, T., Waeijen, L., Peemen, M., Corporaal, H., and He, Y. (2016). MacSim:
a MAC-enabled high-performance low-power SIMD architecture. In 2016 Eu-
romicro Conference on Digital System Design (DSD), pages 160–167. IEEE. doi:
10.1109/DSD.2016.27.
Geng, T., Wang, T., Sanaullah, A., Yang, C., Patel, R., and Herbordt, M. (2018). A frame-
work for acceleration of CNN training on deeply-pipelined FPGA clusters with work and
weight load balancing. In 2018 28th International Conference on Field Programmable
Logic and Applications, pages 394–402. doi: 10.1109/FPL.2018.00074.
Geng, T., Wang, T., Sanaullah, A., Yang, C., Xuy, R., Patel, R., and Herbordt, M. (2018).
FPDeep: acceleration and load balancing of CNN training on FPGA clusters. In 2018
IEEE 26th Annual International Symposium on Field-Programmable Custom Computing
Machines, page 81–84. doi: 10.1109/ FCCM.2018. 00021.
Geng, T., Wang, T., Wu, C., Li, Y., Yang, C., Wu, W., Li, A., and Herbordt, M. (2021).
O3BNN-R: an out-of-order architecture for high-performance and regularized BNN in-
ference. IEEE Transactions on Parallel and Distributed Systems, 32(1):199–213. doi:
10.1109/TPDS.2020.3013637.
Geng, T., Wang, T., Wu, C., Yang, C., Li, A., Song, S., and Herbordt, M. (2019a). LP-
BNN: Ultra-low-latency BNN inference with layer parallelism. In 2019 IEEE 30th
International Conference on Application-specific Systems, Architectures and Processors,
volume 2160, pages 9–16. doi: 10.1109/ASAP.2019.00-43.
Geng, T., Wang, T., Wu, C., Yang, C., Wu, W., Li, A., and Herbordt, M. (2019b).
O3BNN: an out-of-order architecture for high-performance Binarized Neural Network
inference with fine-grained pruning. ACM International Conference on Supercomput-
ing, 2160:461–472. doi: 10.1145/ 3330345. 3330386.
Geng, T., Wu, C., Tan, C., Fang, B., Li, A., and Herbordt, M. (2020). CQNN: a CGRA-
based QNN framework. In 2020 IEEE High Performance Extreme Computing Confer-
ence, pages 1–8. doi: 10.1109/HPEC43674.2020.9286194.
George, A. D., Herbordt, M. C., Lam, H., Lawande, A. G., Sheng, J., and Yang, C. (2016).
Novo-G#: large-scale reconfigurable computing with direct and programmable intercon-
nects. In 2016 IEEE High Performance Extreme Computing Conference (HPEC), pages
1–7. doi: 10.1109/HPEC.2016.7761639.
Ghasemzadeh, M., Samragh, M., and Koushanfar, F. (2018). ReBNet: residual bi-
narized neural network. In 2018 IEEE 26th Annual International Symposium on
Field-Programmable Custom Computing Machines (FCCM), pages 57–64. doi:
10.1109/FCCM.2018.00018.
Gokhale, M. and Graham, P. (2005). Reconfigurable Computing: Accelerating Computa-
tion with Field Programmable Gate Arrays. Springer.
221
Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin, C. (2012). PowerGraph:
distributed graph-parallel computation on natural graphs. In Proceedings of the 10th
USENIX Conference on Operating Systems Design and Implementation, pages 17–30.
USENIX Association. doi: 10.5555/2387880.2387883.
Gori, M., Monfardini, G., and Scarselli, F. (2005). A new model for learning in graph
domains. In Proceedings of 2005 IEEE International Joint Conference on Neural Net-
works, volume 2, pages 729–734 vol. 2. doi: 10.1109/IJCNN.2005.1555942.
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A.,
Jia, Y., and He, K. (2017). Accurate, large minibatch SGD: training ImageNet in 1 hour.
Computing Research Repository (CoRR) in arXiv, abs/1706.02677.
Greathouse, J. L. and Daga, M. (2014). Efficient sparse matrix-vector multiplication on
GPUs using the CSR storage format. In Proceedings of the International Conference for
High Performance Computing, Networking, Storage and Analysis, pages 769–780. doi:
10.1109/SC.2014.68.
Guan, Y., Liang, H., Xu, N., Wang, W., Shi, S., Chen, X., Sun, G., Zhang, W., and Cong,
J. (2017). FP-DNN: An automated framework for mapping deep neural networks onto
FPGAs with RTL-HLS hybrid templates. In 2017 IEEE 25th Annual International Sym-
posium on Field-Programmable Custom Computing Machines, pages 152–159. IEEE.
doi: 10.1109/FCCM.2017.25.
Haghi, P., Geng, T., Guo, A., Wang, T., and Herbordt, M. (2020a). FP-AMG: FPGA-based
acceleration framework for Algebraic Multigrid Solvers. In 28th IEEE International
Symposium on Field-Programmable Custom Computing Machines. DOI: 10.1109/
FCCM48280.2020.00028.
Haghi, P., Geng, T., Guo, A., Wang, T., and Herbordt, M. (2020b). A reconfig-
urable compute-in-the-network FPGA assistant for high-level collective support with
distributed matrix multiply case study. In IEEE Conference on Field Programmable
Technology.
Haghi, P., Guo, A., Xiong, Q., Patel, R., Yang, C., Geng, T., Broaddus, J., Marshall, R.,
Skjellum, A., and Herbordt, M. (2020c). FPGAs in the network and novel communicator
support accelerate MPI collectives. In IEEE High Performance Extreme Computing
Conference.
Ham, T. J., Wu, L., Sundaram, N., Satish, N., and Martonosi, M. (2016). Graphicionado:
a high-performance and energy-efficient accelerator for graph analytics. In 2016 49th
Annual IEEE/ACM International Symposium on Microarchitecture, pages 1–13. IEEE.
doi: 10.1109/MICRO.2016.7783759.
222
Hamilton, W., Ying, Z., and Leskovec, J. (2017). Inductive representation learning on large
graphs. In Proceedings of the 31st International Conference on Neural Information
Processing Systems, pages 1024–1034. doi:10.5555/3294771.3294869.
Han, S., Kang, J., Mao, H., Hu, Y., Li, X., Li, Y., Xie, D., Luo, H., Yao, S., Wang, Y.,
et al. (2017). ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In
Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays, pages 75–84. ACM.
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M. A., and Dally, W. J. (2016).
EIE: efficient inference engine on compressed deep neural network. In Proceedings
of the 43rd International Symposium on Computer Architecture, pages 243–254. IEEE.
doi: 10.1109/ISCA.2016.30.
Han, S., Mao, H., and Dally, W. J. (2015). Deep compression: Compressing deep neural
networks with pruning, trained quantization and huffman coding. Computing Research
Repository (CoRR) in arXiv, abs/1510.00149.
Hauck, S. and DeHon, A. (2008). Reconfigurable Computing: The Theory and Practice of
FPGA-Based Computing. Morgan Kaufmann.
He, Y., Peemen, M., Waeijen, L., Diken, E., Fiumara, M., Rauwerda, G., Corporaal, H.,
and Geng, T. (2016). A configurable SIMD architecture with explicit datapath for in-
telligent learning. In 2016 International Conference on Embedded Computer Systems:
Architectures, Modeling and Simulation (SAMOS), pages 156–163.
He, Y., Zhang, X., and Sun, J. (2017). channel pruning for accelerating very deep neural
networks. In 2017 IEEE International Conference on Computer Vision, pages 1398–
1406. doi: 10.1109/ICCV.2017.155.
Hegde, G. and Kapre, N. (2017). CaffePresso: accelerating convolutional networks on
embedded SoCs. ACM Transactions on Embedded Computing Systems, 17(1):1–26.
doi: 10.1145/3105925.
Henaff, M., Bruna, J., and LeCun, Y. (2015). Deep convolutional networks on graph-
structured data. Computing Research Repository (CoRR) in arXiv, abs/1506.05163.
Hennessy, J. L. and Patterson, D. A. (2011). Computer Architecture: A quantitative ap-
proach, 5th Edition. Morgan Kaufmann Publishers Inc.
Herbordt, M. (2019). Advancing OpenCL for FPGAs: Boosting performance with Intel
FPGA SDK for OpenCL technology. In The Parallel Universe, pages 17–32.
Herbordt, M., Gu, Y., VanCourt, T., Model, J., Sukhwani, B., and Chiu, M. (2008). Com-
puting models for FPGA-based accelerators with case studies in molecular modeling.
Computing in Science and Engineering, 10(6):35–45. doi: 10.1109/ MCSE.2008.143.
223
Herbordt, M., VanCourt, T., Gu, Y., Sukhwani, B., Conti, A., Model, J., and DiSabello, D.
(2007). Achieving high performance with FPGA-based computing. IEEE Computer,
40(3):42–49.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation,
9(8):1735–1780.
Hu, Y., Zhai, J., Li, D., Gong, Y., Zhu, Y., Liu, W., Su, L., and Jin, J. (2018). Bit-
Flow: exploiting vector parallelism for binary neural networks on CPU. In 2018 IEEE
International Parallel and Distributed Processing Symposium, pages 244–253. doi:
10.1109/IPDPS.2018.00034.
Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le,
Q. V., Wu, Y., et al. (2019). GPipe: efficient training of giant neural networks using
pipeline parallelism. In Advances in Neural Information Processing Systems, pages
103–112.
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. (2016). Binarized
neural networks. In Proceedings of the 30th International Conference on Neural Infor-
mation Processing Systems, pages 4107–4115. doi: 10.5555/3157382.3157557.
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. (2017). Quan-
tized neural networks: Training neural networks with low precision weights and ac-
tivations. The Journal of Machine Learning Research, 18(1):6869–6898. doi:
10.5555/3122009.3242044.
Huo, Z., Gu, B., and Huang, H. (2018a). training neural networks using features replay.
In Proceedings of the 32nd International Conference on Neural Information Processing
Systems, pages 6660–6669. doi: 10.5555/3327757.3327772.
Huo, Z., Gu, B., Yang, Q., and Huang, H. (2018b). Decoupled parallel backpropaga-
tion with convergence guarantee. Proceedings of the 35th International Conference on
Machine Learning, pages 2098–2106.
Ji, S., Xu, W., Yang, M., and Yu, K. (2013). 3D convolutional neural networks for human
action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,
35(1):221–231. doi: 10.1109/TPAMI.2012.59.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S.,
and Darrell, T. (2014). Caffe: convolutional architecture for fast feature embedding. In
Proceedings of the 22nd ACM International Conference on Multimedia, pages 675–678.
doi: 10.1145/2647868.2654889.
Jia, Z., Lin, S., Qi, C. R., and Aiken, A. (2018). Exploring hidden dimensions in paralleliz-
ing convolutional neural networks. Computing Research Repository (CoRR) in arXiv,
abs/1802.04924.
224
Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia,
S., Boden, N., Borchers, A., Boyle, R., Cantin, P.-l., Chao, C., Clark, C., Coriell, J.,
Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T. V., Gottipati, R., Gulland,
W., Hagmann, R., Ho, C. R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey,
A., Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch, A., Kumar, N., Lacy,
S., Laudon, J., Law, J., Le, D., Leary, C., Liu, Z., Lucke, K., Lundin, A., MacKean, G.,
Maggiore, A., Mahony, M., Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix,
K., Norrie, T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A.,
Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Steinberg, D., Swing, A.,
Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E., Vasudevan, V., Walter, R., Wang,
W., Wilcox, E., and Yoon, D. H. (2017). In-datacenter performance analysis of a tensor
processing unit. Proceedings of the 44th Annual International Symposium on Computer
Architecture, page 1–12. doi: 10.1145/3140659.3080246.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014).
Large-scale video classification with convolutional neural networks. In IEEE conference
on Computer Vision and Pattern Recognition, pages 1725–1732.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. (2016). On
large-batch training for deep learning: Generalization gap and sharp minima. Comput-
ing Research Repository (CoRR) in arXiv, abs/1609.04836.
Khoram, S. and Li, J. (2018). Adaptive quantization of neural networks. In International
Conference on Learning Representations.
Kim, D., Ahn, J., and Yoo, S. (2017). A novel zero weight/activation-aware hardware
architecture of convolutional neural network. In Design, Automation & Test in Europe
Conference & Exhibition (DATE), 2017, pages 1462–1467. IEEE.
Kipf, T. N. and Welling, M. (2016). Semi-supervised classification with graph convolu-
tional networks. Computing Research Repository (CoRR) in arXiv, abs/1609.02907.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems,
pages 1097–1105.
Kung, H., McDanel, B., and Zhang, S. Q. (2019). Packing sparse convolutional neural
networks for efficient systolic array implementations: Column combining under joint
optimization. In Proceedings of the Twenty-Fourth International Conference on Archi-
tectural Support for Programming Languages and Operating Systems, pages 821–834.
ACM.
Kuon, I. and Rose, J. (2007). Measuring the Gap Between FPGAs and ASICs. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 26(2):203–
215.
225
Kwon, H., Chatarasi, P., Pellauer, M., Parashar, A., Sarkar, V., and Krishna, T. (2019).
Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric
approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on
Microarchitecture, pages 754–768.
Kwon, H., Chatarasi, P., Sarkar, V., Krishna, T., Pellauer, M., and Parashar, A. (2020).
Maestro: A data-centric approach to understand reuse, performance, and hardware cost
of dnn mappings. IEEE Micro, 40(3):20–29.
Lam, M. (1988). Software pipelining: An effective scheduling technique for vliw ma-
chines. In Proceedings of the ACM SIGPLAN 1988 conference on Programming Lan-
guage design and Implementation, pages 318–328.
Lam, M., Yedidia, Z., Banbury, C., and Reddi, V. J. (2020). Quantized neural network
inference with precision batching. Computing Research Repository (CoRR) in arXiv,
abs/2003.00822.
Latapy, M. (2008). Main-memory triangle computations for very large (sparse (power-
law)) graphs. Theoretical computer science, 407(1-3):458–473.
LeaderGPU (2018a). Tensorflow Alexnet benchmark. https://www.leadergpu.com/
articles/428-tensorflow-alexnet-benchmark. [Online; accessed 19-July-2018].
LeaderGPU (2018b). Tensorflow VGG-16 benchmark. https://www.leadergpu.com/
articles/430-tensorflow-vgg16-benchmark. [Online; accessed 19-July-2018].
Leonard, R., Doddington, G., and Consortium, L. D. (1993). TIDIGITS. Linguistic Data
Consortium.
Li, A., Geng, T., Wang, T., Herbordt, M., Song, S., and Barker, K. (2019a). BSTC: A novel
binarized-soft-tensor-core design for accelerating bit-based approximated neural nets. In
International Conference for High Performance Computing, Networking, Storage and
Analysis (SC). doi: 10.1145/ 3295500.3356169.
Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J.,
Shekita, E. J., and Su, B.-Y. (2014). Scaling distributed machine learning with the
parameter server. In 11th USENIX Symposium on Operating Systems Design and Imple-
mentation (OSDI 14), pages 583–598.
Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. (2015). Gated graph sequence neural
networks. Computing Research Repository (CoRR) in arXiv, abs/1511.05493.
Li, Z., Ding, C., Wang, S., Wen, W., Zhuo, Y., Liu, C., Qiu, Q., Xu, W., Lin, X., Qian,
X., et al. (2019b). E-RNN: Design optimization for efficient recurrent neural networks
in FPGAs. In 2019 IEEE International Symposium on High Performance Computer
Architecture (HPCA), pages 69–80. IEEE.
226
Lian, R. L. (2016). A framework for FPGA-based acceleration of neural network inference
with limited numerical precision via high-level synthesis with streaming functionality.
PhD thesis, University of Toronto.
Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. (2017). Can decen-
tralized algorithms outperform centralized algorithms? a case study for decentralized
parallel stochastic gradient descent. In Advances in Neural Information Processing Sys-
tems, pages 5330–5340.
Liang, S., Yin, S., Liu, L., Luk, W., and Wei, S. (2018). FP-BNN: Binarized neural network
on FPGA. Neurocomputing, 275:1072–1086.
Lin, X., Zhao, C., and Pan, W. (2017). Towards accurate binary convolutional neural
network. In Advances in Neural Information Processing Systems, pages 345–353.
Liu, W. and Vinter, B. (2014). An efficient GPU general sparse matrix-matrix multipli-
cation for irregular data. In 2014 IEEE 28th International Parallel and Distributed
Processing Symposium, pages 370–381. IEEE.
Liu, Y., Zhang, N., Wu, D., Botterud, A., Yao, R., and Kang, C. (2020). Guiding cascading
failure search with interpretable graph convolutional network. Computing Research
Repository (CoRR) in arXiv, abs/2001.11553.
Lu, L., Liang, Y., Xiao, Q., and Yan, S. (2017). Evaluating fast algorithms for convolu-
tional neural networks on FPGAs. In IEEE 25th Annual International Symposium on
Field-Programmable Custom Computing Machines, pages 101–108.
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011). Learning
word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of
the association for computational linguistics: Human language technologies-volume 1,
pages 142–150. Association for Computational Linguistics.
Mao, H., Han, S., Pool, J., Li, W., Liu, X., Wang, Y., and Dally, W. J. (2017). Exploring
the regularity of sparse structure in convolutional neural networks. Computing Research
Repository (CoRR) in arXiv, abs/1705.08922.
Marcus, M., Santorini, B., and Marcinkiewicz, M. A. (1993). Building a large annotated
corpus of english: The penn treebank. Computational Linguistics, 19(2):313–330. doi:
10.5555/972470.972475.
Micheli, A. (2009). Neural network for graphs: A contextual constructive approach. IEEE
Transactions on Neural Networks, 20(3):498–511.
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B.,
Houston, M., Kuchaiev, O., Venkatesh, G., et al. (2017). Mixed precision training.
Computing Research Repository (CoRR) in arXiv, abs/1710.03740.
227
Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., and Khudanpur, S. (2010). Recurrent
neural network based language model. In Eleventh annual conference of the interna-
tional speech communication association.
Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi,
M., Bengio, S., and Dean, J. (2017). Device placement optimization with reinforcement
learning. In 34th International Conference on Machine Learning, pages 2430–2439.
Miyashita, D., Lee, E. H., and Murmann, B. (2016). Convolutional neural networks us-
ing logarithmic data representation. Computing Research Repository (CoRR) in arXiv,
abs/1603.01025.
Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J. (2016). Pruning convolutional
neural networks for resource efficient transfer learning. Computing Research Repository
(CoRR) in arXiv, abs/1611.06440.
Moss, D. J., Nurvitadhi, E., Sim, J., Mishra, A., Marr, D., Subhaschandra, S., and Leong,
P. H. (2017). High performance binary neural networks on the Xeon+ FPGA™ platform.
In 2017 27th International Conference on Field Programmable Logic and Applications
(FPL), pages 1–4. IEEE.
Narang, S., Undersander, E., and Diamos, G. (2017). Block-sparse recurrent neural net-
works. Computing Research Repository (CoRR) in arXiv, abs/1711.02782.
Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R.,
Gibbons, P. B., and Zaharia, M. (2019). PipeDream: generalized pipeline parallelism
for DNN training. In 27th ACM Symposium on Operating Systems Principles, pages
1–15.
Nguyen, H.-T., Ngo, Q.-D., and Le, V.-H. (2018). IoT botnet detection approach based on
PSI graph and DGCNN classifier. In 2018 IEEE International Conference on Informa-
tion Communication and Signal Processing (ICICSP), pages 118–122. IEEE.
Nurvitadhi, E., Sheffield, D., Sim, J., Mishra, A., Venkatesh, G., and Marr, D. (2016).
Accelerating binarized neural networks: comparison of FPGA, CPU, GPU, and ASIC.
In International Conference on Field-Programmable Technology, pages 77–84.
Ovtcharov, K., Ruwase, O., Kim, J.-Y., Fowers, J., Strauss, K., and Chung, E. S. (2015).
Accelerating deep convolutional neural networks using specialized hardware. Microsoft
Research Whitepaper 2015, 2(11).
Ozdal, M. M., Yesil, S., Kim, T., Ayupov, A., Greth, J., Burns, S., and Ozturk, O. (2016).
Energy efficient architecture for graph analytics accelerators. ACM SIGARCH Computer
Architecture News, 44(3):166–177.
228
Pal, S., Beaumont, J., Park, D.-H., Amarnath, A., Feng, S., Chakrabarti, C., Kim, H.-S.,
Blaauw, D., Mudge, T., and Dreslinski, R. (2018). Outerspace: An outer product based
sparse matrix multiplication accelerator. In 2018 IEEE International Symposium on
High Performance Computer Architecture (HPCA), pages 724–736. IEEE.
Pang, B. and Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment
categorization with respect to rating scales. In Proceedings of the 43rd annual meeting
on association for computational linguistics, pages 115–124. Association for Computa-
tional Linguistics.
Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer,
J., Keckler, S. W., and Dally, W. J. (2017). SCNN: an accelerator for compressed-
sparse convolutional neural networks. In 2017 ACM/IEEE 44th Annual International
Symposium on Computer Architecture (ISCA), pages 27–40. IEEE.
Park, E., Kim, D., and Yoo, S. (2018a). Energy-efficient neural network accelerator based
on outlier-aware low-precision computation. In 2018 ACM/IEEE 45th Annual Interna-
tional Symposium on Computer Architecture (ISCA), pages 688–698. IEEE.
Park, E., Yoo, S., and Vajda, P. (2018b). Value-aware quantization for training and in-
ference of neural networks. In Proceedings of the European Conference on Computer
Vision (ECCV), pages 580–595.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin,
Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-
performance deep learning library. In Advances in Neural Information Processing Sys-
tems, pages 8024–8035.
Patel, R., Wolfe, P.-F., Munafo, R., Varia, M., and Herbordt, M. (2020). Arithmetic and
Boolean Secret Sharing MPC on FPGAs in the Data Center. In HPExC. doi: TBD.
Plessl, C. (2018). Bringing FPGAs to HPC Production Systems and Codes. In H2RC’18
workshop at Supercomputing (SC’18). doi: 10.13140/RG.2.2.34327.42407.
Putnam, A. (2014). A Reconfigurable Fabric for Accelerating Large-Scale Datacenter
Services. In International Symposium on Computer Architecture, pages 13–24. doi:
10.1109/ISCA.2014.6853195.
Qin, E., Samajdar, A., Kwon, H., Nadella, V., Srinivasan, S., Das, D., Kaul, B., and Kr-
ishna, T. (2020). Sigma: A sparse and irregular GEMM accelerator with flexible in-
terconnects for DNN training. In 2020 IEEE International Symposium on High Perfor-
mance Computer Architecture (HPCA), pages 58–70. IEEE.
Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. (2016). XNOR-net: Imagenet
classification using binary convolutional neural networks. In European Conference on
Computer Vision, pages 525–542.
229
Ravanelli, M., Brakel, P., Omologo, M., and Bengio, Y. (2018). Light gated recurrent
units for speech recognition. IEEE Transactions on Emerging Topics in Computational
Intelligence, 2(2):92–102.
Sak, H., Senior, A., and Beaufays, F. (2014). Long short-term memory based recurrent
neural network architectures for large vocabulary speech recognition. Computing Re-
search Repository (CoRR) in arXiv, abs/1402.1128.
Sanaullah, A., C.Yang, Alexeev, Y., Yoshii, K., and Herbordt, M. (2018a). Application
aware tuning of reconfigurable multi-layer perceptron architectures. In IEEE High Per-
formance Extreme Computing Conference.
Sanaullah, A. and Herbordt, M. (2018a). An empirically guided optimization framework
for FPGA OpenCL. In 2018 International Conference on Field Programmable Technol-
ogy (FPT), pages 46–53. doi: 10.1109/FPT.2018.00018.
Sanaullah, A. and Herbordt, M. (2018b). Unlocking performance-programmability by
penetrating the Intel FPGA OpenCL toolflow. In 2018 IEEE High Performance extreme
Computing Conference (HPEC). doi: 10.1109/HPEC.2018.8547646.
Sanaullah, A., Sachdeva, V., and Herbordt, M. (2018b). SimBSP: Enabling RTL Simu-
lation for Intel FPGA OpenCL Kernels. In Proc. Heterogeneous High Performance
Reconfigurable Computing. doi: 10.1186/s12859-018-2505-7.
Sanaullah, A., Yang, C., Alexeev, Y., Yoshii, K., and Herbordt, M. (2018c). Real-time
data analysis for medical diagnosis using FPGA accelerated neural networks. BMC
Bioinformatics, 19 Supplement 18. doi: 10.1186/s12859-018-2505-7.
Sanaullah, A., Yang, C., Alexeev, Y., Yoshii, K., and Herbordt, M. C. (2018d). Application
aware tuning of reconfigurable multi-layer perceptron architectures. In 2018 IEEE High
Performance Extreme Computing Conference, pages 1–9.
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. (2008). The
graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80.
Sheng, J., Xiong, Q., Yang, C., and Herbordt, M. (2017a). Collective communication on
FPGA clusters with static scheduling. ACM SIGARCH Computer Architecture News,
44(4). doi: 10.1145/ 3039902.3039904.
Sheng, J., Yang, C., Caulfield, A., Papamichael, M., and Herbordt, M. (2017b). HPC
on FPGA clouds: 3D FFTs and implications for Molecular Dynamics. In 27th Inter-
national Conference on Field Programmable Logic and Applications. doi: 10.23919/
FPL.2017.8056853.
230
Sheng, J., Yang, C., and Herbordt, M. (2015). Towards low-latency communication on
FPGA clusters with 3D FFT case study. In International Symposium on Highly Efficient
Accelerators and Reconfigurable Technologies. https:// pdfs.semanticscholar.org /832d/
c69145f5ba0ed6a951583201b1b20dd 2096e.pdf.
Sheng, J., Yang, C., and Herbordt, M. (2016). Application-aware collective communi-
cation on FPGA clusters. In IEEE 24th Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM). doi: 10.1109/ FCCM.2016.55.
Sheng, J., Yang, C., and Herbordt, M. C. (2018a). High performance communication on
reconfigurable clusters. In 2018 28th International Conference on Field Programmable
Logic and Applications.
Sheng, J., Yang, C., Wang, T., and Herbordt, M. (2018b). High performance dynamic
communication on reconfigurable clusters. In 2018 IEEE 26th Annual International
Symposium on Field-Programmable Custom Computing Machines.
Shi, R., Dong, P., Geng, T., Ding, Y., Ma, X., So, H. K.-H., Herbordt, M., Li, A., and Wang,
Y. (2020). CSB-RNN: A faster-than-realtime RNN acceleration framework with com-
pressed structured blocks. In Proceedings of the 34th ACM International Conference on
Supercomputing. doi: 10.1145/3392717.3392749.
Simonyan, K. and Zisserman, A. (2015). Very deep convolutional networks for large-scale
image recognition. In International Conference on Learning Representations.
Song, L., Zhuo, Y., Qian, X., Li, H., and Chen, Y. (2018). GraphR: Accelerating graph pro-
cessing using ReRAM. In 2018 IEEE International Symposium on High Performance
Computer Architecture (HPCA), pages 531–543. IEEE.
Song, M., Zhao, J., Hu, Y., Zhang, J., and Li, T. (2018). Prediction based execution on
deep neural networks. In 2018 ACM/IEEE 45th Annual International Symposium on
Computer Architecture (ISCA), pages 752–763. doi: 10.1109/ISCA.2018.00068.
Sukhwani, B. and Herbordt, M. (2008). Acceleration of a production rigid molecule dock-
ing code. In 2008 International Conference on Field Programmable Logic and Applica-
tions, pages 341–346. doi: 10.1109/ FPL.2008.4629955.
Sukhwani, B. and Herbordt, M. (2009). GPU acceleration of a production molecular
docking code. In Workshop on General Purpose Processing on Graphics Processing
Units (GPGPU).
Sun, X., Xu, Z., Meng, N., Lam, E. Y., and So, H. K.-H. (2016). Data-driven light field
depth estimation using deep convolutional neural networks. In 2016 International Joint
Conference on Neural Networks (IJCNN), pages 367–374. IEEE.
231
Tang, W., Hua, G., and Wang, L. (2017). How to train a compact binary neural network
with high accuracy? In Thirty-First AAAI conference on artificial intelligence.
Tomasulo, R. M. (1967). An efficient algorithm for exploiting multiple arithmetic units.
IBM Journal of research and Development, 11(1):25–33.
Umuroglu, Y., Fraser, N. J., Gambardella, G., Blott, M., Leong, P., Jahre, M., and Vissers,
K. (2017). Finn: A framework for fast, scalable binarized neural network inference.
In 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,
pages 65–74.
VanCourt, T., Gu, Y., and Herbordt, M. (2004). FPGA acceleration of rigid molecule inter-
actions. In 12th Annual IEEE Symposium on Field-Programmable Custom Computing
Machines, pages 300–301. doi: 10.1109/ FCCM.2004.33.
VanCourt, T. and Herbordt, M. (2004). Families of FPGA-based algorithms for ap-
proximate string matching. In Proceedings. 15th IEEE International Conference on
Application-Specific Systems, Architectures and Processors, 2004., pages 354–364. doi:
10.1109/ ASAP.2004.1342484.
VanCourt, T. and Herbordt, M. (2005a). LAMP: A tool suite for families of FPGA-based
application accelerators. In International Conference on Field Programmable Logic and
Applications, 2005. doi: 10.1109/ FPL.2005.1515797.
VanCourt, T. and Herbordt, M. (2005b). Three dimensional template correlation: Ob-
ject recognition in 3D voxel data. In Seventh International Workshop on Computer
Architecture for Machine Perception (CAMP’05), pages 153–158. doi: 10.1109/
CAMP.2005.52.
VanCourt, T. and Herbordt, M. (2006a). Application-dependent memory interleaving en-
ables high performance in FPGA-based grid computations. In IEEE Conference on Field
Programmable Logic and Applications, pages 395–401. doi: 10.1109/ FCCM.2006.25.
VanCourt, T. and Herbordt, M. (2006b). Rigid molecule docking: FPGA reconfiguration
for alternative force laws. Journal on Applied Signal Processing, v2006:1–10. doi:
10.1155/ ASP/2006/97950.
VanCourt, T. and Herbordt, M. (2006c). Sizing of processing arrays for FPGA-based
computation. In 2006 International Conference on Field Programmable Logic and
Applications, pages 755–760. doi: 10.1109/ FPL.2006.311307.
VanCourt, T. and Herbordt, M. (2007). Families of FPGA-based accelerators for ap-
proximate string matching. Microprocessors and Microsystems, 31(2):135–145. doi:
10.1016/ j.micpro.2006.04.001.
232
VanCourt, T. and Herbordt, M. (2009). Elements of high performance reconfigurable
computing. In Zelkowitz, M., editor, Advances in Computers, volume v75, pages 113–
157. Elsevier. doi: 10.1016/ S0065-2458(08)00802-4.
Vasilache, N., Zinenko, O., Theodoridis, T., Goyal, P., DeVito, Z., Moses, W. S., Ver-
doolaege, S., Adams, A., and Cohen, A. (2018). Tensor comprehensions: Framework-
agnostic high-performance machine learning abstractions. Computing Research Repos-
itory (CoRR) in arXiv, abs/1802.04730.
Venkataramani, S., Ranjan, A., Banerjee, S., Das, D., Avancha, S., Jagannathan, A., Durg,
A., Nagaraj, D., Kaul, B., Dubey, P., et al. (2017). ScaleDeep: a scalable compute ar-
chitecture for learning and evaluating deep networks. SIGARCH Computer Architecture
News, 45(2):13–26.
Wang, J., Lou, Q., Zhang, X., Zhu, C., Lin, Y., and Chen, D. (2018). Design flow of
accelerating hybrid extremely low bit-width neural network in embedded FPGA. In 2018
28th International Conference on Field Programmable Logic and Applications (FPL),
pages 163–1636. doi: 10.1109/FPL.2018.00035.
Wang, K., Liu, Z., Lin, Y., Lin, J., and Han, S. (2019a). HAQ: hardware-aware automated
quantization with mixed precision. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 8612–8620.
Wang, S., Li, Z., Ding, C., Yuan, B., Qiu, Q., Wang, Y., and Liang, Y. (2018a). C-LSTM:
Enabling efficient LSTM using structured compression techniques on FPGAs. In Pro-
ceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays, pages 11–20. ACM.
Wang, S., Lin, P., Hu, R., Wang, H., He, J., Huang, Q., and Chang, S. (2019b). Accelera-
tion of LSTM with structured pruning method on FPGA. IEEE Access, 7:62930–62937.
Wang, T., Geng, T., Jin, X., and Herbordt, M. (2019c). Accelerating AP3M-based com-
putational astrophysics simulations with reconfigurable clusters. In 2019 IEEE 30th
International Conference on Application-specific Systems, Architectures and Processors
(ASAP), pages 181–184. doi: 10.1109/ ASAP.2019.000-5.
Wang, T., Geng, T., Jin, X., and Herbordt, M. (2019d). FP-AMR: a reconfigurable fabric
framework for block-structured Adaptive Mesh Refinement Applications. In 2019 IEEE
27th Annual International Symposium on Field-Programmable Custom Computing Ma-
chines (FCCM), pages 245–253. doi: 10.1109/ FCCM.2019. 00040.
Wang, T., Geng, T., Li, A., Jin, X., and Herbordt, M. (2020). FPDeep: Scalable accel-
eration of CNN training on deeply-pipelined FPGA clusters. IEEE Transactions on
Computers, 69(08):1143–1158. doi: 10.1109/TC.2020.3000118.
233
Wang, T., Li, A., Geng, T., and Herbordt, M. (2018b). Energy efficiency of realtime
reconfigurable caches on FPGAs. In International Conference for High Performance
Computing, Networking, Storage and Analysis (SC18).
Wei, X., Yu, C. H., Zhang, P., Chen, Y., Wang, Y., Hu, H., Liang, Y., and Cong, J.
(2017). Automated systolic array architecture synthesis for high throughput CNN in-
ference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference
2017, page 29.
Wen, W., Chen, Y., Li, H., He, Y., Rajbhandari, S., Zhang, M., Wang, W., Liu, F., and
Hu, B. (2018). Learning intrinsic sparse structures within long short-term memory. In
International Conference on Learning Representations.
Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. (2016). Learning structured sparsity in
deep neural networks. In Advances in Neural Information Processing Systems, pages
2074–2082.
Weston, J., Bordes, A., Chopra, S., Rush, A. M., van Merriënboer, B., Joulin, A., and
Mikolov, T. (2015). Towards AI-complete question answering: A set of prerequisite toy
tasks. Computing Research Repository (CoRR) in arXiv, abs/1502.05698.
Winter, P. and Smith, J. M. (1992). Path-distance heuristics for the Steiner problem in
undirected networks. Algorithmica, 7(1-6):309–327.
Wolfe, P.-F., Patel, R., Munafo, R., Varia, M., and Herbordt, M. (2020). Secret sharing
MPC on FPGAs in the datacenter. In IEEE Conference on Field Programmable Logic
and Applications.
Wu, C., Geng, T., Sachdeva, V., Sherman, W., and Herbordt, M. (2020a). A
communication-efficient multi-chip design for range-limited Molecular Dynamics. In
2020 IEEE High Performance extreme Computing Conference (HPEC).
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao,
Y., Gao, Q., Macherey, K., et al. (2016). Google’s neural machine translation sys-
tem: Bridging the gap between human and machine translation. Computing Research
Repository (CoRR) in arXiv, abs/1609.08144.
Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Philip, S. Y. (2020b). A comprehensive
survey on graph neural networks. IEEE Transactions on Neural Networks and Learning
Systems.
Xiang, Z., Wang, T., Geng, T., Xiang, T., Jin, X., and Herbordt, M. (2018). Soft-core,
multiple-lane, FPGA-based ADCs for a liquid helium environment. In 2018 IEEE
High Performance extreme Computing Conference (HPEC), pages 1–6. doi: 10.1109/
HPEC.2018.8547550.
234
Xie, C., Yan, L., Li, W.-J., and Zhang, Z. (2014). Distributed power-law graph comput-
ing: Theoretical and empirical analysis. In Advances in neural information processing
systems, pages 1673–1681.
Xie, T. and Grossman, J. C. (2018). Crystal graph convolutional neural networks for an
accurate and interpretable prediction of material properties. Physical review letters,
120(14):145301.
Xiong, Q., Yang, C., Haghi, P., Skjellum, A., and Herbordt, M. (2020). Accelerating
MPI collectives with FPGAs in the network and novel communicator support. In 2020
IEEE 28th Annual International Symposium on Field-Programmable Custom Computing
Machines (FCCM).
Xiong, Q., Yang, C., Patel, R., Geng, T., Skjellum, A., and Herbordt, M. (2019). GhostSZ:
A transparent SZ lossy compression framework with FPGAs. In 2019 IEEE 27th An-
nual International Symposium on Field-Programmable Custom Computing Machines
(FCCM), pages 258–266. doi: 10.1109/FCCM.2019.00042.
Xu, K., Hu, W., Leskovec, J., and Jegelka, S. (2018). How powerful are graph neural
networks? Computing Research Repository (CoRR) in arXiv, abs/1810.00826.
Yan, M., Deng, L., Hu, X., Liang, L., Feng, Y., Ye, X., Zhang, Z., Fan, D., and Xie, Y.
(2020). HyGCN: A GCN Accelerator with Hybrid Architecture. Computing Research
Repository (CoRR) in arXiv, abs/2001.02514.
Yang, C., Geng, T., Wang, T., Patel, R., Xiong, Q., Sanaullah, A., Lin, C., Sachdeva, V.,
Sherman, W., and Herbordt, M. (2019a). Fully Integrated FPGA Molecular Dynamics
Simulations. In International Conference for High Performance Computing, Network-
ing, Storage and Analysis, pages 1–31. doi: 10.1145/ 3295500.3356179.
Yang, C., Geng, T., Wang, T., Sheng, J., Lin, C., Sachdeva, V., Sherman, W., and Herbordt,
M. (2019b). Molecular Dynamics range-limited force evaluation optimized for FPGA.
In 2019 IEEE 30th International Conference on Application-specific Systems, Architec-
tures and Processors (ASAP), pages 263–271. doi: 10.1109/ ASAP.2019.00016.
Yang, C., Sheng, J., Patel, R., Sanaullah, A., Sachdeva, V., and Herbordt, M. (2017a).
OpenCL for HPC with FPGAs: Case study in molecular electrostatics. In 2017 IEEE
High Performance Extreme Computing Conference (HPEC), pages 1–8. doi: 10.1109/
HPEC.2017. 8091078.
Yang, H. (2019). Aligraph: A comprehensive graph neural network platform. In Pro-
ceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery
& Data Mining, pages 3165–3166. ACM.
235
Yang, T.-J., Chen, Y.-H., and Sze, V. (2017b). Designing energy-efficient convolutional
neural networks using energy-aware pruning. In IEEE Conference on Computer Vision
and Pattern Recognition, pages 5687–5695.
You, J., Ying, R., Ren, X., Hamilton, W. L., and Leskovec, J. (2018). GraphRNN: A
deep generative model for graphs. Computing Research Repository (CoRR) in arXiv,
abs/1802.08773.
Yun, S., Jeong, M., Kim, R., Kang, J., and Kim, H. J. (2019). Graph transformer networks.
In Advances in Neural Information Processing Systems, pages 11960–11970.
Zhang, C., Wu, D., Sun, J., Sun, G., Luo, G., and Cong, J. (2016a). Energy-efficient
CNN implementation on a deeply pipelined FPGA cluster. In Proceedings of the 2016
International Symposium on Low Power Electronics and Design, pages 326–331.
Zhang, M., Zhuo, Y., Wang, C., Gao, M., Wu, Y., Chen, K., Kozyrakis, C., and Qian,
X. (2018). Graphp: Reducing communication for pim-based graph processing with
efficient data partition. In 2018 IEEE International Symposium on High Performance
Computer Architecture (HPCA), pages 544–557. IEEE.
Zhang, S., Du, Z., Zhang, L., Lan, H., Liu, S., Li, L., Guo, Q., Chen, T., and Chen, Y.
(2016b). Cambricon-x: An accelerator for sparse neural networks. In The 49th Annual
IEEE/ACM International Symposium on Microarchitecture, page 20. IEEE Press.
Zhao, R., Niu, X., Wu, Y., Luk, W., and Liu, Q. (2017a). Optimizing CNN-based object
detection algorithms on embedded FPGA platforms. In International Symposium on
Applied Reconfigurable Computing, pages 255–267. Springer.
Zhao, R., Song, W., Zhang, W., Xing, T., Lin, J.-H., Srivastava, M., Gupta, R., and
Zhang, Z. (2017b). Accelerating binarized convolutional neural networks with software-
programmable FPGAs. In 2017 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, pages 15–24.
Zhao, W., Fu, H., Luk, W., Yu, T., Wang, S., Feng, B., Ma, Y., and Yang, G. (2016). F-
CNN: An FPGA-based framework for training convolutional neural networks. In 2016
IEEE 27th International Conference on Application-specific Systems, Architectures and
Processors (ASAP), pages 107–114. IEEE.
Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., and Zou, Y. (2016). Dorefa-net: Training
low bitwidth convolutional neural networks with low bitwidth gradients. Computing
Research Repository (CoRR) in arXiv, abs/1606.06160.
Zhuang, C. and Ma, Q. (2018). Dual graph convolutional networks for graph-based semi-
supervised classification. In Proceedings of the 2018 World Wide Web Conference,
pages 499–508.
236
Zhuo, L. and Prasanna, V. K. (2005). Sparse matrix-vector multiplication on FP-
GAs. In Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-
programmable gate arrays, pages 63–74. ACM.
Zitnik, M., Agrawal, M., and Leskovec, J. (2018). Modeling polypharmacy side effects
with graph convolutional networks. Bioinformatics, 34(13):i457–i466.
CURRICULUM VITAE
238
239
240
241
