Lupulus: A Flexible Hardware Accelerator for Neural Networks by Kristensen, Andreas Toftegaard et al.
LUPULUS: A FLEXIBLE HARDWARE ACCELERATOR FOR NEURAL NETWORKS
Andreas Toftegaard Kristensen∗, Robert Giterman∗, Alexios Balatsoukas-Stimming†, and Andreas Burg∗
∗Telecommunication Circuits Laboratory, E´cole polytechnique fe´de´rale de Lausanne, Switzerland
†Department of Electrical Engineering, Eindhoven University of Technology, Netherlands
ABSTRACT
Neural networks have become indispensable for a wide range
of applications, but they suffer from high computational-
and memory-requirements, requiring optimizations from the
algorithmic description of the network to the hardware
implementation. Moreover, the high rate of innovation in ma-
chine learning makes it important that hardware implemen-
tations provide a high level of programmability to support
current and future requirements of neural networks. In this
work, we present a flexible hardware accelerator for neural
networks, called Lupulus, supporting various methods for
scheduling and mapping of operations onto the accelerator.
Lupulus was implemented in a 28 nm FD-SOI technology
and demonstrates a peak performance of 380GOPS/GHz
with latencies of 21.4ms and 183.6ms for the convolutional
layers of AlexNet and VGG-16, respectively.
I. INTRODUCTION
Neural networks (NNs) have state-of-the-art perfor-
mance for a wide range of applications, including speech-
recognition [1], time-series forecasting [2], and computer
vision [3]. However, this performance comes at the cost of
high computational complexity and storage, as billions of
multiply-accumulate (MAC) operations and many megabytes
of memory for the NN parameters are required [4]. This is a
problem in resource- and energy-constrained devices, which
necessitates optimizations from the algorithmic description
of the NN down to the hardware.
Several hardware accelerators (HwAs) have recently been
proposed to optimize the execution of NNs [5]–[13]. These
HwAs achieve high performance by exploiting the inher-
ent parallelism of NNs, splitting the computations across
hundreds of processing elements (PEs) with maximum data-
reuse within and across PEs using small local memories.
While these HwAs are relatively simple at a high-level,
the large number of design parameters and requirements for
different NNs lead to complex design decisions. Moreover,
given the high cost and effort of FPGA/ASIC implementa-
tions and the rate of innovation in machine learning, HwAs
should provide a high level of programmability to support the
current and future requirements for NNs, while maintaining
a high utilization of the hardware resources.
Contribution: In this paper, we describe Lupulus, a flexible
HwA for NNs, which supports a variety of NN architectures
by applying different scheduling and operation mapping
strategies. We synthesize Lupulus using a 28 nm FD-SOI
technology with a 1V operating voltage and a target fre-
quency of 1GHz, providing a theoretical peak performance
of 380GOPS. Results for NN execution time show that
Lupulus is capable of efficiently executing the different
layers of AlexNet and VGG-16, and outperform a similar
accelerator on VGG-16 when on-chip resources and memory
interface bandwidth are matched.
Relation to Previous Work: Many existing accelerators,
such as [5]–[8] have small local memories in the PEs for the
weights, inputs, and partial sums, which may leave the local
memories for the partial sums underutilized when partial
sums are forwarded to a neighboring PE instead of being
stored in the PE itself. Moreover, the partial sums may have
to be read out to a high-level memory and then sent back
later if the local memories are too small. In our case, the
partial sums are stored for groups of PEs, making it easier to
fully utilize the memory for the partial sums. A similar archi-
tecture to Lupulus is [8], which also uses 3×3 blocks of PEs.
However, for 1×1 convolutions, only two PEs out of nine can
be turned on in [8], whereas our architecture can use all PEs.
II. BACKGROUND
In a NN, the raw input data is transformed into high-level
abstract representations to extract useful information in a
process called inference, involving multiple stages of non-
linear processing, each of which is referred to as a layer. In
feed-forward neural networks (FFNNs), the execution of a
layer is given as
a[L] = f(W[L]a[L−1] + b[L]) , (1)
where a[L−1] is the output from layer L − 1, W[L] are
the network weights, b[L] are the biases in layer L, and
f(·) is a non-linear activation function. Convolutional
neural networks (CNNs) are the most common deep neural
networks for image processing. For a CNN, the matrix-
vector product in (1) is replaced by the cross-correlation
operation, which for a 2-D input feature map F is defined
as G[i, j] =
∑k
u=−k
∑k
v=−k h[u, v]F [i + u, j + v], where
h is the kernel/filter of shape (2k + 1) × (2k + 1). When
CNNs are used with 3-D inputs, such as RGB images, the
input feature map consists of multiple 2-D planes, called
channels, and the weight matrix W is composed of multiple
3-D filters/kernels, individually applied to the input feature
map [14].
ar
X
iv
:2
00
5.
01
01
6v
1 
 [e
es
s.S
P]
  3
 M
ay
 20
20
Algorithm 1 Cross-correlation algorithm for CNNs, where
I is the input feature map, W are the weights, and O is
the output feature map. The height, width, and number of
channels of a data-structure A are denoted by hA, wA, and
cA, respectively, and s is the stride across I.
Require: I[cI][hI][wI],W[cO][cI][hF][wF],O[cO][hO][wO].
1: for gO ← 1 to cO do
2: F←W[gO]
3: for iO ← 1 to hO do
4: for jO ← 1 to wO do
5: for gI ← 1 to cI do
6: for iF ← 1 to hF do
7: for jF ← 1 to wF do
8: O[gO][iO][jO]← O[gO][iO][jO]
+ I[gI][s×iO+iF][s×jO+jF]
× F[gI][iF][jF]
II-A. Design Challenges and Optimizations
While NNs consist mainly of MAC operations, modern
NNs come in many shapes and sizes [15]–[17]. These
variations in layer size and parameters present a problem for
efficiently executing different networks on a custom HwA.
The full sequence of operations for a CNN can also be
represented as multiple nested loops, as shown in Alg. 1.
For a given number of PEs and some amount of local
memory, this loop structure can be optimized to better utilize
the available resources and reduce the required memory
access bandwidth, posing significant challenges on operation
scheduling and mapping to maximize data-reuse [18]. Input
reuse relates to the reuse of the input feature map I. Given
a filter of shape hF × wF and cO filters, each pixel in the
input feature map can be reused approximately hF×wF×cO
times, depending on the padding and stride. In partial sum
reuse, the partial sums from the inner-loop in Alg. 1 can be
shared between multiple PEs operating in parallel and stored
locally until the final result can be delivered. Furthermore,
given an output feature map of shape hO×wO, convolutional
reuse allows each filter to be reused hO×wO times for the
same input feature map channel as the filter is slid across it.
Finally, multiple input sets are usually grouped into batches
of size N , providing weight reuse as the same weights can
be used for all batches. Weight reuse is only beneficial for
increasing the throughput and experience has shown that
most users prefer reduced latency when deploying NNs [19].
Therefore, we only consider input, partial sum, and con-
volutional reuse. Since different NNs (and even different
layers) favor different types of reuse under a set of hardware
constraints, the challenge is to provide efficient support for
all or most of them in an accelerator.
III. LUPULUS: DESIGN AND IMPLEMENTATION
In this section, we present the design and implementation
of our proposed HwA, Lupulus, by providing a full system
overview. Then, we consider the scheduling and mapping
Mesh NetworkInput Buffers Processing Grid
+ +
Input Buffer 
Fetch Unit
PE SPM 
Fetch Unit
Grid 
Controller
Global 
Controller
+ +
PE Controller
× +
SPM
Input
Weight
Output
Fig. 1: Top-level view of Lupulus, with details of a single
PE in the top-right corner. The inputs, weights, and outputs
are stored stored in the input buffers ( ), in the SPMs of the
PEs ( ), and in the accumulators ( ), respectively.
strategies supported by the HwA, which enables both data-
reuse and executing different NN layers.
III-A. High-Level Architecture
A top-level view of Lupulus is shown in Fig. 1. The HwA
is composed of controllers, a PE grid, input buffers, and a
mesh network connecting the input buffers to the PE grid.
The global controller is responsible for dispatching offline
generated instructions to the fetch units, the processing grid
controller and the PEs. The global controller also controls
which fetch units are running, and when the processing grid
can start processing the input feature map in the input buffers
and the weights in the PEs. The fetch units are programmed
with a loop structure, similar to Alg. 1, specifying when to
fetch different segments of the weights and inputs and how
to map these segments to the HwA. The input buffers contain
one SRAM instance per row of PEs. Extra memory banks for
double-buffering can be added to enable the overlapping of
fetching new inputs and the computation. The input buffers
connect to the PEs through a mesh network, which allows
for different ways of connecting the memories to the PEs.
Connections can be one-to-one as shown in Fig. 2b or one-
to-many as shown in Fig. 2a. The processing grid contains
the PE groups. Each PE group contains a compile-time
configurable number of PEs, which can be optimized for
a specific kernel size, and an accumulator module used to
accumulate values across the rows of PEs in the group and
store the partial sums. Small mux networks share the partial
sums between the PEs before reaching the accumulators
and allow for different groups of PEs to be merged. The
instructions received from the global controller configure
the PEs for different padding and striding options. Finally,
the processing grid controller configures the mesh network
between the input buffers and the PEs based on the NN
layer parameters and it indicates to the PEs which weights
to use from their local scratch-pad memories (SPMs). It also
Processing 
Elements
Accumulators Processing 
Elements
Accumulators
Kernels
Input Buffers
Input Feature Map
(a) Mapping and scheduling of four 3 × 3 kernels and one
channel of a 6× 6 input feature map.
Input Buffers Processing 
Elements
Accumulators Processing 
Elements
Accumulators
Input Feature Map
Kernels
(b) Mapping and scheduling of 1× 1 kernels and two channels
of a 6× 6 input feature map.
Fig. 2: Mapping and scheduling of filters on a 6×6 PE grid.
communicates the status of the processing back to the global
controller to indicate when data can be freed from the input
buffers, the SPMs in the PEs, and the accumulators.
III-B. Operation Scheduling and Mapping
We now consider the scheduling and mapping of the data
to the HwA, where scheduling refers to when different parts
of the data-structures are operated on and the mapping de-
scribes on which PE the associated operations are executed.
Fig. 2a provides an example with 3 × 3 kernels, in this
case, the kernel size matches the PE group and the weights
of each kernel are mapped directly to a single group of
PEs, indicated by the uniformly colored PEs in Fig. 2a.
The inputs can be used for all the filters which can be
scheduled simultaneously. If not all filters fit, the input
values are loaded back later, lowering the input reuse. In
cases where the kernel is larger than the PE group, such as
a 5 × 5 kernel, the kernel is decomposed along the rows
into smaller chunks and extended horizontally into the next
PE group. Kernels can only be decomposed along the rows,
meaning that PE grid width is the main limitation in terms
of mapping convolutional kernels. However, as small kernels
are the norm today, this is not a limiting issue.
Table I: Area for Lupulus synthesized for 28 nm at 1GHz,
and 1V (TT).
µm2 Percentage
Accumulators 387 440 32.54%
Controllers 3564 0.30%
Input Buffers 238 110 20.00%
Fetch Units 9528 0.80%
Networks 43 169 3.63%
PE Grid 488 325 41.02%
Total 1 190 600
Fig. 2b provides an example with size 1×1 kernels, which
are typically applied later in the network to reduce the num-
ber of channels. Therefore, they are usually quite deep and
mapping single kernels to a full column of PEs across groups
provides a way of reducing the worst-case memory footprint
on the PEs, compared with mapping to a column in a single
PE group. The accumulators are then set to forward the
partial sums upwards to further accumulate with the partials
sums generated by the PE groups above. The same mapping
is used for FC layers, where the rows of the weight matrix
for an FC layer corresponds to the filters and the number of
channels is equivalent to the width of the weight matrix.
IV. RESULTS
In this section, we first present the results of synthesizing
Lupulus. Then, we consider the performance of the HwA on
a set of benchmarks and compare Lupulus with Eyeriss [5]
and similar HwAs in terms of performance.
IV-A. Synthesis Results
Table I shows the area results for the HwA, synthesized
using a 28 nm FD-SOI technology with a 1V operating
voltage and a target frequency of 1GHz. The input data
is quantized to 8-bits and partial sums to 16-bits. A grid
of 15 × 12 PEs with 3 × 3 PE groups is used to match
Eyeriss [5] in terms of maximum parallelism, with 32B for
each PE, 256B for each input buffer, and 2048B for each
output buffer, adding up to 60.16 kB of memory with double-
buffering enabled for the input buffers and the SPMs in the
PEs. The PEs, accumulators, and input buffers use most of
the area, with the control logic and networks of muxes taking
up an insignificant portion.
IV-B. Benchmarks
We now consider the result of benchmarking on the con-
volutional layers of AlexNet [15] and VGG-16 [16]. While
AlexNet is an older network, it contains layers with many
different kernel sizes, which helps assess the limitations of
the Lupulus architecture as it is optimized for the 3×3 kernel
size. The benchmark results are generated using a high-level
model of Lupulus. The model considers the time it takes
1 2 3 4 5
0
2
4
6
8
Layer
L
at
en
cy
[m
s]
0
20
40
60
80
100
Pe
rc
en
ta
ge
PE Fetch Input Fetch Write Compute
(a) AlexNet [15] results with a total latency of 21.4ms.
2 4 6 8 10 12
0
10
20
30
Layer
L
at
en
cy
[m
s]
0
20
40
60
80
100
Pe
rc
en
ta
ge
PE Fetch Input Fetch Write Compute
(b) VGG-16 [16] results with a total latency of 183.6ms.
Fig. 3: Per-layer latency and percentage of the total clock-
cycles for different operations shown as bars and lines, re-
spectively, for convolutional layers of AlexNet and VGG-16.
to fetch, process and write out the results, given a memory
interface with some bandwidth. The hardware parameters are
used to calculate the sizes of the different loops, similar to
Alg. 1, which determines the time for fetching/writing. The
computation time inside Lupulus is determined by the loop
structure and the number of pipeline stages.
Fig. 3 shows the latency of Lupulus for AlexNet and
VGG-16. The bars show the latency for each layer and the
lines the percentage of clock cycles the HwA spends on
fetching data to the PEs, fetching data to the input buffers,
writing the results out, and computing the results. For the
model, we assume that the external memory is connected
through a 32-bit interface running at 250MHz, providing a
bandwidth of 1GB/s. The on-chip frequency is set to the
same as for synthesis, 1GHz.
For both networks, the memory interface is used almost
100% of the time, as shown in Fig. 3. Initially, when the
filters are shallow, the memory interface is dominated by
fetching input feature map values. Deeper in the network,
the weight fetching dominates, as shown for the last layers
in AlexNet and VGG-16. The total number of clock cycles
where the PEs are active also varies significantly, with
80% in layer 2 of AlexNet and 40% in the remaining
convolutional layers. While the on-chip clock-frequency can
Table II: Comparison of Lupulus HwA with Eyeriss [5].
Network Eyeriss Ours Speedup
AlexNet 28.8ms 91.6ms 0.31
VGG-16 1436.5ms 773.0ms 1.86
Table III: Performance comparison with similar HwAs,
indicating the peak performance at the maximum frequency.
[5] [8] [20] Lupulus
Process [nm] 65 65 65 28
Clk [GHz] 0.25 0.5 0.75 1.0
Gate Count [kGE] 1176 1300 697 799
Memory [kB] 181.5 112 43 60.61
Bit-width 16 16 8 8
GOPS 84 152 274 380
GOPS/GHz 336 304 365 380
be slightly lowered to increase the utilization, with only a
small increase in the total latency, the utilization can never
reach 100% with a single channel memory interface. With
only a single channel, it is not possible to simultaneously
read and write to the external memory, meaning that when
the memories in the accumulators are full, the PEs stall.
While performance comparisons with other HwAs are
difficult due to different architectural optimizations, process
technologies, and methods of reporting results, we can
compare our benchmark results against those of Eyeriss [5]
due to similarities in the architectures and their level of com-
plexity. Table II shows the results for AlexNet and VGG-16
when model parameters such as on-chip clock-frequency
and speed of the external memory interface are adjusted
to match Eyeriss. For AlexNet, our HwA is approximately
3× slower, but close to twice as fast for VGG-16. The
difference for AlexNet is due to Eyeriss having two layers
of memory, a shared global memory and local memories in
the PEs for the inputs, weights, and outputs. This makes
Eyeriss able to better utilize the input feature maps for
larger kernel sizes, whereas Lupulus has lower input reuse
in this case. However, Lupulus processes the smaller filters
very efficiently as the inputs are streamed in and processed
directly, whereas Eyeriss first has to load these into the PE
memories. As VGG-16 represents a more modern workload,
with small kernel sizes, the results for the current version of
the HwA are satisfactory.
Finally, Table III compares similar accelerators, showing
the process, the maximum frequency, the logic area, the
amount of on-chip memory, the bit-width for inputs, and the
GOPS (for reported frequency and normalized to 1GHz).
In terms of gate count and performance, Lupulus achieves
the highest peak performance among the considered acceler-
ators, at a 32% and 39% lower gate count compared to [5]
and [8], respectively. However, we use 14% more gates
than [20] with only a 4% improvement in GOPS/GHz.
V. CONCLUSION
In this paper, we described Lupulus, a flexible hardware
architecture supporting different types of NN architectures.
Lupulus provides the capability of merging different groups
of PEs or overlap small convolutional kernels inside groups
to improve the utilization of the local memories and the PEs
depending on the type of network. Lupulus can be optimized
for a specific kernel size to maximize the performance, while
still supporting the execution of different NNs efficiently,
with just a single layer of memory. Lupulus was imple-
mented in a 28 nm FD-SOI technology utilizing 60 kB of
on-chip memory and demonstrating a peak performance of
380GOPS/GHz with latencies of 21.4ms and 183.6ms for
the convolutional layers of AlexNet and VGG-16, respec-
tively.
VI. REFERENCES
[1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed,
N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N.
Sainath, and B. Kingsbury, “Deep neural networks for
acoustic modeling in speech recognition: The shared
views of four research groups,” IEEE Signal Processing
Magazine, vol. 29, no. 6, pp. 82–97, Nov. 2012.
[2] J. Schmidhuber, “Deep learning in neural networks:
An overview,” Neural Networks, vol. 61, p. 85117,
Jan. 2015. [Online]. Available: http://dx.doi.org/10.
1016/j.neunet.2014.09.003
[3] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”
Nature, vol. 521, no. 7553, pp. 436–444, May 2015.
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual
learning for image recognition,” in IEEE Conference
on Computer Vision and Pattern Recognition (CVPR),
June 2016.
[5] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss:
An energy-efficient reconfigurable accelerator for deep
convolutional neural networks,” IEEE Journal of Solid-
State Circuits, vol. 52, no. 1, pp. 127–138, Jan. 2017.
[6] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A.
Horowitz, and W. J. Dally, “EIE: Efficient inference
engine on compressed deep neural network,” in Annual
International Symposium on Computer Architecture
(ISCA), 2016, pp. 243–254. [Online]. Available:
https://doi.org/10.1109/ISCA.2016.30
[7] X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu,
Y. Liang, and J. Cong, “Automated systolic array ar-
chitecture synthesis for high throughput CNN inference
on FPGAs,” in ACM/EDAC/IEEE Design Automation
Conference (DAC), June 2017, pp. 1–6.
[8] L. Du, Y. Du, Y. Li, J. Su, Y. Kuan, C. Liu, and M. F.
Chang, “A reconfigurable streaming deep convolutional
neural network accelerator for internet of things,” IEEE
Transactions on Circuits and Systems I: Regular Pa-
pers, vol. 65, no. 1, pp. 198–208, Jan. 2018.
[9] R. Andri, L. Cavigelli, D. Rossi, and L. Benini,
“YodaNN: An architecture for ultralow power binary-
weight cnn acceleration,” IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Sys-
tems, vol. 37, no. 1, pp. 48–60, Jan 2018.
[10] K. Hegde, J. Yu, R. Agrawal, M. Yan, M. Pellauer,
and C. W. Fletcher, “UCNN: Exploiting computational
reuse in deep neural networks via weight repetition,”
in Annual International Symposium on Computer Ar-
chitecture (ISCA), 2018, pp. 674–687.
[11] S. Yin, P. Ouyang, S. Tang, F. Tu, X. Li, S. Zheng,
T. Lu, J. Gu, L. Liu, and S. Wei, “A high energy ef-
ficient reconfigurable hybrid neural network processor
for deep learning applications,” IEEE Journal of Solid-
State Circuits, vol. 53, no. 4, pp. 968–982, 2017.
[12] G. Desoli, N. Chawla, T. Boesch, S.-p. Singh,
E. Guidetti, F. De Ambroggi, T. Majo, P. Zambotti,
M. Ayodhyawasi, H. Singh, and N. Aggarwal, “A 2.9
TOPS/W deep convolutional neural network SoC in FS-
SOI 28nm for intelligent embedded systems,” in IEEE
International Solid-State Circuits Conference (ISSCC),
2017, pp. 238–239.
[13] T. Luo, S. Liu, L. Li, Y. Wang, S. Zhang, T. Chen,
Z. Xu, O. Temam, and Y. Chen, “DaDianNao: A
neural network supercomputer,” IEEE Transactions on
Computers, vol. 66, no. 1, pp. 73–88, Jan. 2016.
[14] Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Con-
volutional networks and applications in vision,” in
IEEE International Symposium on Circuits and Systems
(ISCAS), May 2010, pp. 253–256.
[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton,
“ImageNet classification with deep convolutional
neural networks,” Communications of the ACM,
vol. 60, no. 6, pp. 84–90, May 2017. [Online].
Available: http://doi.acm.org/10.1145/3065386
[16] K. Simonyan and A. Zisserman, “Very deep convo-
lutional networks for large-scale image recognition,”
arXiv 1409.1556, Sep. 2014.
[17] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Ra-
binovich, “Going deeper with convolutions,” in IEEE
Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2015, pp. 1–9.
[18] V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efficient
processing of deep neural networks: A tutorial and
survey,” Proceedings of the IEEE, vol. 105, no. 12, pp.
2295–2329, Dec 2017.
[19] N. P. Jouppi, C. Young, N. Patil, D. Patterson,
G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden,
A. Borchers et al., “In-datacenter performance analysis
of a tensor processing unit,” in Annual International
Symposium on Computer Architecture (ISCA), 2017,
pp. 1–12.
[20] L. Cavigelli and L. Benini, “Origami: A 803-gop/s/w
convolutional network accelerator,” IEEE Transactions
on Circuits and Systems for Video Technology, vol. 27,
no. 11, pp. 2461–2475, 2016.
