DeepPump: Multi-pumping deep Neural Networks by Zhao, R et al.
DeepPump: Multi-Pumping Deep Neural Network
Ruizhe Zhao, Tim Todman, Wayne Luk, Xinyu Niu
Department of Computing, Imperial College London, United Kingdom
{ruizhe.zhao15, timothy.todman, w.luk, niu.xinyu10}@imperial.ac.uk
I. INTRODUCTION
This paper presents DeepPump, a novel approach for gener-
ating and optimising hardware designs of deep Convolutional
Neural Networks (CNNs) with multi-pumping on FPGA plat-
forms. Multi-pumping [1] is a promising technique to save
hardware resource usage by replacing M parallel units with
one clocked at M times the global clock rate. DeepPump
aims at automatically adopting multi-pumping when gener-
ating hardware designs for CNNs. It has three components:
a parameterised CNN accelerator architecture that supports
multi-pumping, a design model for trade-off analysis related
to multi-pumping, and an optimisation flow for improving the
architecture based on the design model.
II. DEEPPUMP FRAMEWORK
DeepPump contains a multi-pumped streaming architecture,
in which an input stream of feature maps is buffered and
computed, and the computation units are multi-pumped. Multi-
pumping works effectively in this architecture because there
are many parallel computing units that can achieve high
maximum clock frequencies. We characterise this architecture
with three parameters: N is the original number of parallel
blocks, M is the multi-pumping factor, and F stands for
the global clock frequency. We also devise a parameterised
implementation of this architecture that is described as follows.
First, the streaming interface of the multi-pumping blocks,
i.e. the width and frequency of streams, remains the same.
Second, the number of parallel blocks is reduced by a factor
of M , and both their clock frequencies and number of cycles
are increased by a factor of M . Third, additional logic to
control the behaviour at each multi-pumped cycle, such as
multiplexers and counters, is attached to the original design. In
this way, we build a parameterised multi-pumped architecture.
Based on this architecture, we derive a design model for
trade-off analysis while applying multi-pumping, by predicting
the latency, resource usage and power consumption. Latency
(O(NF )) only relates to N and F ; multi-pumping normally
does not affect the speed. Resource usage (O(N/M) +
O(NM)) has components which are directly and inversely
proportional toM . Power consumption consists of static power
(O(N/M)) and dynamic power (O(NMF )) components.
Moreover, we provide an optimisation flow for generat-
ing multi-pumped designs with optimised parameters. It first
identifies constant parameters within the design model by
learning from real hardware builds. It then solves a constrained
TABLE I
EVALUATION RESULTS OF DEEPPUMP. RESOURCE EFFICIENCY IS
MEASURED IN GOP (GIGA OPERATIONS) PER SECOND PER SLICE.
[2] [3] DeepPump
FPGA Virtex VX485T Zynq XC7045 Stratix V 5SGSDB
Technology 28 nm 28 nm 28 nm
Data Type 32-bit float 16-bit fixed 16-bit fixed
Freq. (MHz) 100 150 150
Power (W) 18.61 9.63 28.0
Perf. (GOP/s) 61.62 187.8 401.1
(GOP/s/Slice) 0.81× 10−3 3.58× 10−3 6.11× 10−3
optimisation problem with minimising latency as objective,
and with reduced resource usage and power consumption as
constraints.
To evaluate DeepPump, we compare optimised designs gen-
erated by DeepPump with designs from previous work [2] [3].
Our implementation targets the Maxeler MAX4 dataflow en-
gine and processes multiple batches of 512×32×32 (channel
× height × width) input feature maps by a convolution layer
with 512 output filters. Table I shows that the design generated
by DeepPump outperforms other designs in both performance
(GOp/s) and resource efficiency (GOp/s/Slice). The number
of slices for a Stratix V device is estimated by the number of
resource groups with equivalent logic capacity.
III. CONCLUSION
This paper presents DeepPump, an approach that generates
CNN hardware designs with multi-pumping, which have com-
petitive performance when compared with previous designs.
Future work includes integrating DeepPump with other opti-
misations, and providing further evaluations on various FPGA
platforms.
ACKNOWLEDGMENT
The support of UK EPSRC (EP/L00058X/1, EP/L016796/1
and EP/N031768/1), the European Horizon 2020 Research and
Innovation Programme under grant agreement number 671653,
Maxeler and Intel is gratefully acknowledged.
REFERENCES
[1] A. Canis, J. H. Anderson, and S. D. Brown, “Multi-pumping for resource
reduction in FPGA high-level synthesis,” in DATE, 2013.
[2] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
FPGA-based accelerator design for deep convolutional neural networks,”
in FPGA, 2015.
[3] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu,
S. Song, Y. Wang, and H. Yang, “Going deeper with embedded FPGA
platform for convolutional neural network,” in FPGA, 2016.
978-1-5090-4825-0/17/$31.00 c©2017 IEEE
