NeuroMAX: A High Throughput, Multi-Threaded, Log-Based Accelerator for
  Convolutional Neural Networks by Qureshi, Mahmood Azhar & Munir, Arslan
NeuroMAX: A High Throughput, Multi-Threaded, Log-Based
Accelerator for Convolutional Neural Networks
Mahmood Azharreshi
Kansas State University
Manhaan, Kansas
mahmood102@ksu.edu
Arslan Munir
Kansas State University
Manhaan, Kansas
amunir@ksu.edu
ABSTRACT
Convolutional neural networks (CNNs) require high throughput
hardware accelerators for real time applications owing to their
huge computational cost. Most traditional CNN accelerators rely
on single core, linear processing elements (PEs) in conjunction with
1D dataows for accelerating convolution operations. is limits
the maximum achievable ratio of peak throughput per PE count
to unity. Most of the past works optimize their dataows to aain
close to a 100% hardware utilization to reach this ratio. In this paper,
we introduce a high throughput, multi-threaded, log-based PE core.
e designed core provides a 200% increase in peak throughput
per PE count while only incurring a 6% increase in area overhead
compared to a single, linear multiplier PE core with same output bit
precision. We also present a 2D weight broadcast dataow which
exploits the multi-threaded nature of the PE cores to achieve a
high hardware utilization per layer for various CNNs. e entire
architecture, which we refer to as NeuroMAX, is implemented
on Xilinx Zynq 7020 SoC at 200 MHz processing clock. Detailed
analysis is performed on throughput, hardware utilization, area and
power breakdown, and latency to show performance improvement
compared to previous FPGA and ASIC designs.
KEYWORDS
Convolutional neural networks (CNNs) , hardware accelerator,
multi-threaded, throughput, hardware utilization
1 INTRODUCTION
Convolutional neural networks (CNNs) enable embedding of AI
into devices for vision-based applications with an unprecedented ac-
curacy. e early proposed high accuracy CNNs [1–3] required tens
of millions of parameters and computations for one inference pass.
is computational complexity along with high memory require-
ments greatly hampered their deployment on low energy, resource
constrained devices. In addition to this, many CNN architectures
used varying kernel sizes which results in recongurability require-
ment as well as low hardware utilization in accelerator designs.
Separable convolution for CNNs was introduced the rst time in
mobilenets [4, 5] to reduce the number of multiply and accumulates
(MACs). In addition, many modern CNNs use kernels of size 3×3 to
promote ease of accelerator design with high hardware utilization
and throughput.
Design of an ecient dataow for scheduling data into the ac-
celerator is equally important. An inecient dataow results in
reduced hardware utilization which causes a decrease in through-
put. Dataow should also promote the reusability of data since,
in most cases, the same kernels are being applied on the entire
input feature map. It has been shown previously that the move-
ment of data to/from DDR memory is 200× more costly in terms of
energy consumption than a standard MAC operation [6]. us, the
dataow design should not only optimize the throughput and area,
but also the data movement, in order to ensure reduced energy
expenditure.
Log-based accelerators have recently gained quite a lot of trac-
tion because of their simpler structure as compared to traditional
accelerators with linear processing elements (PEs). Each PE in tra-
ditional CNN accelerator cores is essentially responsible for one
multiplication in convolution operation. Log PEs replace the bulky
multiplier cores with low cost barrel shiers without incurring a sig-
nicant loss in accuracy. We clarify that cost here primarily refers to
the area cost, which is determined by the number of LookUp Tables
(LUTs) for eld-programmable gate arrays (FPGAs) and gate count
for application-specic integrated circuits (ASICs). is area cost
is important because there are limited resources on-chip and thus
this area cost also translates to monetary cost of system-on-chip
(SoC). Many past approaches have designed log-based PE elements
but have not exploited the low cost and overhead of such PEs. ey
instead rely on already established spatial architectures and 1D
dataows used for linear PEs. Our proposed NeuroMAX acceler-
ator core comprises of 108 PEs arranged in a 6 × 3 × 6, 3D spatial
grid. e presented accelerator optimizes the most commonly used
3 × 3 and 1 × 1 kernel sizes to achieve high throughput and utiliza-
tion. It can also be used for larger kernel sizes because of its grid
structure and congurable 2D dataow. Our main contributions
are as follows:
• We design a multi-threaded, low cost, log-based PE core.
Using this core, we generate a spatial grid of 108 PEs, capa-
ble of performing a wide variety of convolution operations
with high hardware utilization.
• We develop a 2D dataow which exploits the thread based
PE design to maximize the throughput and enhance data
reuse to minimize the DDR memory access.
• We implement the entire NeuroMAX architecture in an
FPGA and show improved performance in terms of area,
throughput, hardware utilization, latency and power e-
ciency compared to past approaches.
2 RELATEDWORK
Many hardware accelerators have been proposed recently and in the
past. [7] proposed a non-systolic array, recongurable spatial archi-
tecture along with a new dataow scheme called row stationary to
maximize the data reuse. However, this design incurs high PE cost
owing to local storage and control in PE. It also has low hardware
utilization which results in low throughput per PE. [8] proposes an
FPGA-based CNN accelerator with integrated depth-wise separable
mode of operation. is accelerator, however, has low throughput
because of the usage of 32-bit oating point format. [9] proposes an
ar
X
iv
:2
00
7.
09
57
8v
1 
 [c
s.A
R]
  1
9 J
ul 
20
20
Figure 1: Linear vs. Logantization (a) 1.5 bits linear VGG16 net (b) 5.0 bits log VGG16 (c) 5.1 bits log VGG16 (d) 1.5 bits linear
SqueezeNet (e) 5.0 bits log SqueezeNet (f) 5.1 bits log SqueezeNet
FPGA-based CNN accelerator having a dedicated matrix multiplica-
tion engine (MME) on Arria 10 SoC. It achieves a frame rate of 266
fps, however, its MME engine has a huge digital signal processing
(DSP) cost of 1200+ DSP blocks. [10] is the improved version of [7]
with higher hardware utilization and throughput. [11] introduced
the concept of logarithmic data representation for neural network
accelerator designs. It also gives accuracy comparison between
linear and log quantization. [12] proposes an accelerator design
using arbitrary log base. It, however, does not utilize the low hard-
ware overhead of the log-based PE and instead rely on linear PE
arrangements. [13] proposes a recongurable design for various
convolution kernels. It uses a propagated input data ow scheme
but incurs high latency and low hardware utilization. [14] proposes
a rescheduled dataow for convolution to optimize the energy e-
ciency. [15] proposes a vectorwise accelerator architecture with the
goal of maximizing the hardware utilization. It supports various
kernel sizes from 1 × 1 to 5 × 5.
Although some of the recent designs achieve high hardware
utilization, they are not able to increase the peak throughput per
PE count beyond unity owing to the use of single core, linear PEs
with high area cost. is paper overcomes the limitations of prior
works by leveraging log PEs with multiple low cost threads within
each log PE, and designing a 2D dataow which promises high
throughput by exploiting multi-level parallelism.
3 LOG MAPPING
Log mapping or log quantization maps an input value x to a log-
arithmically quantized value x ′. Many trained neural nets have
weights w and input activations a which are non-uniformly dis-
tributed. Mapping these 32-bit oating point (fp32), non-uniformly
distributed values over xed point, linearly quantized values intro-
duces signicant amount of quantization noise for small bit width.
Most hardware platforms use xed point arithmetic for data manip-
ulation where the xed point number is represented in signedQm.n
format. Here,m represents the integer part whereas n represents
the fractional part. e range of values which can be represented
are ranдel in = [−2m−1, 2m−1 − ϵ] where, ϵ = 2−n , is the step size.
A linear quantizer rounds the fp32 value to the nearest multiple
of ϵ and then clips it as follows:
xq = clip
[(
round(x
ϵ
)
)
· ϵ,−2m−1, 2m−1 − ϵ
]
(1)
where,
Figure 2: NeuroMAX System Architecture
clip(x ,min,max) =

max, x ≥ max
x , min < x < max
min, otherwise
(2)
A log quantizer takes as input, x and the quantization parameters
< m,n,b >, where b is the logarithmic base, and produces a log
quantized value x ′ as output. e quantization process can be
wrien as:
x ′ = clip
[(round(loдb (|x |))) ,−2m−1, 2m−1 − ϵ ] (3)
xq =
{
0, x = 0
siдn(x) · bx ′ , otherwise (4)
Figure 1 shows some of the quantization results for the rst
ve convolution layers of VGG16 [2] and SqueezeNet [16]. Instead
of using log base-2 for quantization, we use log base-
√
2 for more
accurate mapping as shown in Figure 1(c) and (f). Infact, we observe
that VGG16, pretrained on ImageNet dataset, with fp32 data, aer
base-
√
2 quantization, has top-1 accuracy decrement by only ≈3.5%
from 67.5% to 63.8%. is is opposed to log base-2 quantization
which decreases the accuracy by ≈10%. is observation has also
been veried in [11].
2
Figure 3: (a) Compute read (b) Collection of reads to Make a PE (d) 6x3 PE Matrix 0 and Adder Net 0
Figure 4: Adder Net 0 psum Generation
4 HARDWARE ARCHITECTURE
4.1 Top-Level
Figure 2 shows the top level hardware architecture of the proposed
NeuroMAX CNN accelerator on Zynq-7020 SoC. e CONV core
is the accelerator module containing a memory block, a state con-
troller, PE grid, adder stages and post processing module. e
memory block contains the weight, input and output SRAMs with
a total cumulative size of 3.8Mb. e PE grid consists of 108 PEs
arranged in 6×3×6 3D array. Figure 2 also shows the internal struc-
ture of the PE grid containing PE matrices numbered from 0 to 5.
e PE matrices are all connected to their respective input, weight
and output SRAM blocks. Each PE matrix processes independent
channels in parallel for standard and separable convolutions for
maximizing the throughput. e outputs from the PE matrices are
provided to their respective adder nets within the PE grid. A total
of six adder net 0s are present corresponding to six PE matrices.
e conguration of these adder nets remain constant regardless
of the type of convolution used or the lter size. e output from
the adder net 0 is provided to six congurable two-stage adders
whose input connections change based on the lter size and the
convolution type. e rst adder stage is referred to as adder net 1
and the second stage is the channel accumulation stage.
To perform a convolution operation, a tile of log quantized input
fmap and weight data is loaded from the o-chip DDR memory
into the SRAMs in the CONV core by AXI DMA and interconnect.
e processor also sends the parameter information containing
the values for lter size, input width, input height, output width,
output height and total channels to the state controller inside the
CONV core. e state controller modies the congurable adders
and determines the dataow to be used for convolution operation.
e linear convolution outputs are sent to the post processing block
which performs ReLU operation and quantizes the results back into
log values using pre-computed log table. ese output log values
are loaded into the output SRAMs and sent back to the o-chip DDR
memory to be used for processing the next layer. No intermediate
outputs or partial sums are stored in the DDR memory and all the
intermediate processing is done within the CONV core to minimize
the o-chip trac.
4.2 PE Matrix
Figure 3 shows the hierarchical design of a single PE matrix (PE
matrix 0) from le to right in a boom up view. Each PE receives
a 1D vector of weight values and one input value. It should be
noted that both the input (i0’) and the weight values (w00-2’) are log
quantized. e output vectors from PEs are provided to the adder
net 0 which generates 18 psums (o1-o18). is adder net works by
summing the same color coded values generated within a row of
PEs as shown in Figure 4.
Figure 3(b) shows the internal structure of a single PE element
(PE0 0 0). ere are three compute cores or threads, each processing
a single weight data and the input value, and in turn, produce three
outputs (p11,p12,p13). e lowest level of the PE matrix is a thread
within an individual PE as shown in Figure 3(a). Basic log-based
multiplication operation is performed in a single thread. Assuming
we have two log quantized values, wq’ and aq’, representing the
original weight (wq) and the activation input (aq) respectively, the
multiplication of these values in log domain can be carried out as:
wqaq = siдn(wq ) · 2дq
′
(5)
where,
д
′
q = w
′
q + a
′
q (6)
[12] showed a method to implement the exponential in equation
(5) in hardware by decomposing the exponent into its integer and
3
Figure 5: A Convolution Example with 12×6 Input, 3×3 Fil-
ter for Stride 1 and 2
Figure 6: State Controller Operation (a) Input, Stride 1 (b)
Filter Weights (c) Input, Stride 2
fractional part as:
wqaq = siдn(wq ) · 2I NT (дq
′ ) · 2FRAC(дq
′ ) (7)
e integer part 2I NT (дq
′ ) can be implemented by a shi opera-
tion, whereas, the fractional part can be pre-computed and stored
within the thread. e total number of fractional computations
depends on the total number of fractional bits (n) used. In our case,
we have n = 1 and thus store 2n = 2 values in the thread memory.
Equation (7) can now be rewrien as:
wqaq = siдn(wq ) · (LUT (FRAC(дq ′)) >> ¬INT ((дq ′))) (8)
e hardware implementation of equation (8) is shown in Figure
3(a). Since weights can have negative values, which is not accounted
for in the log computations, we use an additional bit to represent the
log weight data with the most signicant bitw ′q [6] representing the
sign of the weight before quantization. is is not required for the
input fmap values since most modern CNNs use ReLU activations
which eliminate the negative outputs.
5 DATA FLOW AND PROCESSING
emain idea behind designing an ecient dataow is to minimize
the data movement to/from the costly o-chip DDR memory. One
Figure 7: 2D Weight Broadcast Dataow
MAC operation typically requires three memory reads correspond-
ing to weight, ifmap, psum and one memory write corresponding
to the updated psum. A neural net like AlexNet, with 724M MACs,
will need ≈3000M DDR memory accesses. Many ecient dataows
have been presented in literature to minimize this data movement.
Some of these include output stationary, weight stationary and, row
stationary [17]. Since convolution operation requires the reuse
of lter weights, input and psums in successive operations, the
dataows are designed to optimize the re-usability without access-
ing the DDRmemory. We introduce a 2D weight broadcast dataow
for maximizing the re-usability of the weights, input and psums.
5.1 3 x 3 Convolution
Figure 5 shows a 3 × 3 convolution example. Here, a 12 × 6 input is
convolved with a 3 × 3 lter to produce a 10 × 4 output for stride 1
and a 6 × 3 output for stride 2. A total of 108 bits, corresponding
to the 6 × 3 input tile, are received from the AXI4 interconnect
and stored in the input SRAM. is input tile is modied by the
state controller and provided to the PE matrix in a row shied
paern as shown in Figure 6(a) and (c) for stride 1 and stride 2,
respectively. We also acquire a 2D weight array and broadcast it to
the PE matrix as shown in Figure 6(b). Figure 7 shows the dataow
and the operation of the PE matrix for the rst 6x3 input tile and
the weight matrix at time stamp t = 1. e entire input tile and
the 2D weight array is loaded into the PE matrix simultaneously.
Because of the multi-threaded structure of PEs, each PE performs
three multiplication operations using three threads and the outputs
are row-wise summed to generate the psums (o1-o18) using adder
net 0 (Figure 4). e dataow chart and the processing of the entire
12 × 6 input is shown in Figure 8. e output a0wa012, in Figure
8, represents the three outputs a0wa0, a0wa1, a0wa2 generated by
three threads within a PE. e adder net 0 computes the partial
sum outputs (o1-o18) the same way as shown in Figure 4, where
p11 = a0wa0, p14 = b0wb0 and p17 = c0wc0 are the same colored
(green) outputs along the row.
e dark red outputs (row 5 and 6) in Figure 5 for stride 1 and
green outputs (row 3) for stride 2 represent the boundary outputs.
e boundary condition occurs when the lter overlaps two dif-
ferent column-wise input tile sectors. For clarity, we assume that
4
Figure 8: Dataow Chart for 3×3 Stride 1 Convolution in Figure 5
Figure 9: Adder Net 1 Conguration (a) Stride 1 (b) Stride 2
the rst input tile at t = 1 is processed by the PE matrix. is cor-
responds to the rst six rows and the rst three columns of the
input. e PE matrix will process the last row-wise input tile at t =
4 which corresponds to the rst six rows and the last three columns
of the input as shown in Figure 6(a). e input tile will then jump
to the next column-wise 6x3 input tile which corresponds to the
last six rows and the rst three columns at t = 5 as shown in Figure
6(a). However, it can be seen that the row 5 and 6 in the output
are dependent on the overlapping results from the two concurrent
column-wise input tile sectors (e.g. at t = 1 and t = 5, t = 2 and t = 6
and so on). To resolve this, the three dependent psums (o13, o17
and o16), generated from row 5 and row 6, of rst column-wise tile
sector of the 12x6 input, are passed through a variable length shi
register with the maximum length equal to the width of the input.
ese psums are subsequently utilized when the next column-wise
6x3 input tile (a6 to a11) is being processed. us, the rows 1 to 4
Figure 10: 1 × 1 Convolution Example
in the output are generated during the time intervals t = 1 to t = 4,
whereas the rows 5 to 10 are generated during the time interval t =
5 to t= 8.
e output in Figure 5, for stride 1, is generated by alternate
colored, column-wise summation of the psums in the adder net 1 as
shown in Figure 9(a). Figure 9(b) shows the output generation for
stride 2 case. e shi registers (VAR Len SR) for generating the
boundary outputs are also shown in Figure 9. It can be observed that
because of the optimized dataow, only 2 out of 18 or 11% psums
require local storage as opposed to ¿50% psums requiring storage
(local or o-chip) in previously proposed dataows. e throughput
for the above example is 45 OPS/cycle (total OPS/total cycles =
360/8 = 45), which results in an 83.3% overall thread utilization
(45/(3×6×3))×100. Wewill simply use thread utilization as hardware
utilization in this context.
5.2 1 x 1 Convolution
1 × 1 convolutions are very popular in modern CNNs. ese con-
volutions, along with the depth-wise separable, are replacing the
normal 2-D convolutions because of the less number of MAC oper-
ations [4]. e 1× 1 CONV operation convolves 1× 1×C ×P lters
with aM × N ×C input to produceM × N × P outputs. Here, C is
the number of channels, P is the number of lters, M is the input
width and N is the input height.
Figure 10 shows a 1× 1 CONV example where a 3× 6× 6 input is
convolved with 6, 1×1×6 lters to produce a 3×6×6 output. Since
this convolution generates the psums by channel accumulation, the
outputs from the multiple PE matrices are utilized. For the example
in Figure 10, the state controller data scheduling for PE matrix 0
5
Figure 11: State Controller Load Operation
Figure 12: Dataow Chart for 1×1 Convolution in Figure 10
and 1 is shown in Figure 11. It can be seen that the rst three
channels of the input are convolved with the rst three channels of
all the lters in PE matrix 0, whereas, the last three channels of the
input are convolved with the last three channels of all the lters in
PE matrix 1. e time stamps during specic processing of input
and weights in PE matrices are also shown in Figure 11. It should
be noted that for an input with more channels, the rest of the PE
matrices will also be used. us, by using the dataow in Figure 11,
the architecture can process 18 channels concurrently by using the
6 PE matrices, with each PE matrix processing 3 input and lter
channels.
e dataow chart for the PE matrix 0 for the example in Figure
10 is shown in Figure 12. e same dataow chart can also be
generated for the PEmatrix 1. Asmentioned earlier, the psums in 1×
1 convolution are calculated using channel-wise accumulation. e
eighteen outputs (o1-o18) generated by the individual PE matrices
are summed in their respective adder net 1s. e input connections
for adder net 1 (AN 1 0) and the channel accumulator (CA 0) of PE
Figure 13: (a) Channel-wise accumulation for PE matrix 0
(b) All channel-wise accumulations
Figure 14: (a) 5 × 5 Convolution Example (b) Input Load Op-
eration (c) Weight Load Operation
matrix 0 are shown in Figure 13(a). Here, o10 is the psum output
from the PE matrix 0 and o15 is the psum output from the PE matrix
5. Since the example in Figure 10 is small, it only requires the rst
two PE matrices and their outputs, that is, only o10-180 and o11-
181 are active. e output in Figure 10 is generated by using all
six adder net 1s and the channel accumulators as shown in Figure
13(b). e throughput for the above example is 108 OPS/cycle (total
OPS/total cycles = (6 × 6 × 3 × 6/6 = 108), which results in a 100%
overall thread utilization (108/(3 × 6 × 3 × 2))×100.
5.3 Higher Order Convolutions
e proposed NeuroMAX accelerator is designed to optimize 3 × 3
and 1 × 1 convolutions. It can, however, also be used to accelerate
larger kernel sizes. [18] proposed a kernel decomposition method
such that an additional support for 4 × 4 and 5 × 5 lter is needed
to implement any lter size. Figure 14(a) gives an example of 5 × 5
convolution. As the size of the PE matrix is 6 × 3, a lter of width
greater than 3 and height greater than 6 needs multiple cycles to
6
Figure 15: Dataow Chart for 5 × 5 Convolution in Fig-
ure 14(a)
Figure 16: Adder Conguration for 5 × 5 Convolution
calculate the output value. is can be seen in Figure 14(b) and
(c) where the last two columns of the input matrix and the weight
matrix are loaded at time stamp t = 2. Figure 15 shows the dataow
chart which accounts for this conguration. e generated psums
(o1-o18) are provided to the adder net 1 as shown in Figure 16. For
this convolution, the output values are calculated as:
Va0,Va2 = ((o1+o5+o9)+ (o10+o14))old + (o1+o5+o9)new (9)
Va1 = ((o4+o8+o12)+ (o13+o17))old + (o4+o8+o12)new (10)
In equations (9) and (10), the old value corresponds to the con-
volution output from the rst three columns of the input and the
weight matrix at t = 1, whereas, the new value corresponds to the
last two columns at t = 2. e adder net 1 and the channel accumu-
lator conguration for this convolution is shown in Figure 16. A
similar conguration and dataow chart is used for implementing
a 4 × 4 convolution. In addition to this, the CONV core can also
perform pooling operation by choosing the appropriate stride and
kernel.
6 IMPLEMENTATION AND RESULTS
is section discusses the implementation of the proposed Neuro-
MAX accelerator and presents the area cost, power consumption,
performance, throughput, and hardware utilization results. e
accelerator has been implemented on the PL side of Xilinx Zynq-
7020 SoC operating at 200MHz. Figure 17 shows cost comparison
between our multi-threaded log PE core and an area optimized
linear multiplier core with equal output bit precision and latency.
Table 1: Resource Utilization
Property Accelerator Utilization
#LUTs 20680 38%
#FFs 17207 16%
#36kB BRAMs 108 77%
Power 2.727 W NA
It can be seen that by choosing a thread count of 3 (shown as log
(3) in Figure 17), the LUT and FF cost is only 1.05× and 1.14× that
of the linear PE. us, a total of 108 linear PEs would be equivalent,
in cost, to ≈122 multi-threaded log PEs. For fairness, we will use
the cost adjusted PE number for performance comparison.
Figure 17: Linear vs Log PE LUT and FF Cost at 16-bit Preci-
sion
Table 1 shows the resource utilization of the implemented ac-
celerator core as well as the total power consumption (static +
dynamic). Figure 18 (a), (b), (c) shows the breakdown of LUT cost,
FF cost and power consumption among dierent modules of the
accelerator. e PE grid and the adder net 0 combined have the
highest LUT and FF count (81% and 91%, respectively). e post
processing block consumes negligible resources. e processing
system (ARM core) dominates the power consumption (57%), while
the PE grid and adder net 0 have the second highest consumption
(26%) of the total.
Figure 19 shows layer by layer hardware utilization for various
CNN architectures. We achieve an average utilization of 95%, 84%
and, 86% for VGG-16, MobileNet v1 and, ResNet-34, respectively.
e dip in hardware utilization in some layers of mobilenet and
ResNet-34 is because of stride 2 convolutions which utilize only
50% of the available PE cores. e low utilization in the rst layer of
Figure 18: Breakdown of (a) LUT cost (b) FF cost (c) Power
Consumption for NeuroMAX7
Table 2: Comparison of NeuroMAX with Previous Designs
Property NeuroMAX [7] [8] [9] [10] [12] [15]
Technology Zynq-7020 SoC 65nm Zynq-7100 Arria 10 SoC 65nm Virtex-7 40nm
Precision(bits) 6-bit log 16-bit 32fp 16-bit 8-20 bits 5-bit log 16-bit
PE number 122(adjusted) 168 1926 1278 192 256 168
Processing clock (MHz) 200 200 100 133 200 Unreported 500
Peak roughput (GOPS) 324 84 17.11 170.6 153.6 Unreported 168
Peak roughput/PE 2.7(adjusted) 0.5 0.008 0.13 0.8 Unreported 1
Cost (LUTs(a),gates(b)) 20.6k(a) 1176k(b) 142k(a) 66k(a) 2695k(b) 29k(a) 266k(b)
Power (W) 2.72 0.278 4.083 Unreported 0.460 3.756 0.155
Figure 19: Hardware Utilization of NeuroMAX for (a)VGG-
16 (b) MobileNet v1 (c) ResNet-34
Figure 20: PE count vs Utilization vs roughput Compari-
son of NeuroMAX with [15]
VGG16 is because it only has 3 channels and since each PE matrix
processes one channel, the last 3 PE matrices remain idle which
gives an exact utilization of 50%.
Chang et al. [15] recently presented an accelerator design with
1D broadcast dataowwhich promises higher utilization and through-
put (GOPS) then all the previous designs. We, therefore, compare
our design against [15] in Figure 20 for various CNNs. [15] uses a
total of 168 PE cores and provides a utilization of 99% with through-
put 166.32 GOPS, 93.4% with throughput 156.91 GOPS and, 90.2%
with throughput 151.54 GOPS for VGG16, ResNet-34 and mobilenet,
respectively. We use 122 PE cores (cost adjusted), a 28% decrease
from [15], and provide a throughput of 307.8 GOPS, an 85% increase,
281.8 GOPS, a 79.4% increase, and 268.92 GOPS, a 77.4% increase,
for the three CNNs, respectively. is increase in throughput with
lower PE count is aributed towards our low cost, multi-threaded
PE core design and an ecient 2D dataow. We also achieve some-
what similar hardware utilization, that is, 94% for VGG16, 87.3%
for ResNet-34 and, 83% for mobilenet. It should be noted that [15]
implements the accelerator on an ASIC, whereas, we use an FPGA,
thus, an accurate comparison in LUT count, FF count and, power
consumption cannot be made. It is, however, evident that the design
in [15] when ported into FPGA will have ≈31% more LUTs and FFs
owing to more number of PEs used.
Table 3: VGG16 Latency Comparison
Layer NeuroMAX [7] [15]
CONV1 1(ms) 1.35 38.0 2.57
CONV1 2(ms) 28.9 810.6 55.04
CONV2 1(ms) 14.4 405.3 27.43
CONV2 2(ms) 29.26 810.8 55.7
CONV3 1(ms) 14.54 204 27.7
CONV3 2(ms) 28.6 408.1 54.5
CONV3 3(ms) 28.7 408.1 54.6
CONV4 1(ms) 14.4 105.1 27.42
CONV4 2(ms) 29 210.0 55.23
CONV4 3(ms) 29.5 210.0 56.19
CONV5 1(ms) 7.24 48.3 13.79
CONV5 2(ms) 7.23 48.5 13.77
CONV5 3(ms) 7.11 48.5 13.54
Total(ms) 240.23 3755.3 457.5
Table 2 shows the comparison of our accelerator with previ-
ous state of the art ASIC and FPGA designs. We see an improved
performance in terms of PE number, peak throughput and peak
throughput/PE ratio. Only [15] has a peak throughput/PE ratio
equal to unity with average around 0.85. Our peak throughput/PE
is 3 with average around 2.7 aer cost adjustment. e power com-
parison reveals that the FPGA-based designs inherently consume
more power compared to ASICs. We can, however, see that Neu-
roMAX consumes signicantly less power and has lower cost in
terms of LUT count compared to other FPGA designs.
Table 3 gives a layer-by-layer processing latency comparison
for VGG16. Both [7] and [15] benchmark the latency of their ac-
celerators on this CNN, therefore, we also evaluate and compare
NeuroMAX’s performance on VGG16. It should be noted however
that [15] uses 500MHz processing clock in their design. For fair
comparison, we make suitable adjustments in their reported values.
Our proposed NeuroMAX accelerator has 93% and, 47% decrease in
latency, when compared to [7] and [15], respectively, at 200 MHz
clock.
7 CONCLUSION
is paper proposes NeuroMAX, a high throughput accelerator
using multi-threaded, log-based PE cores. e designed PE cores
8
are capable of providing a 200% increase in peak throughput while
only increasing the area overhead by 6%, when compared to a
standard multiplier-based PE core. We also design an ecient
2D weight broadcast dataow scheme which exploits the multi-
level parallelism of our processing engine and enables hardware
utilization close to 100%. e accelerator is capable of performing
a wide variety of convolutions including standard and separable
3 × 3 stride 1 and 2, 4 × 4, 5 × 5 and 1 × 1 depthwise, required in
modern CNN architectures. We have implemented NeuroMAX on
Xilinx Zynq-7020 SoC and have evaluated various performance
parameters. e design can provide at least a throughput increase
of 77.4% and a latency decrease of 47% with a 28% decrease in PE
count against recently proposed accelerator designs for modern
CNNs. NeuroMAX also provides at least a 27% and a 29% decrease
in power consumption and LUT count, respectively, against prior
FPGA-based CNN accelerators.
REFERENCES
[1] Alex Krizhevsky, Ilya Sutskever, and Georey E. Hinton. Imagenet classication
with deep convolutional neural networks. In Proceedings of the 25th International
Conference on Neural Information Processing Systems - Volume 1, NIPS12, page
10971105, Red Hook, NY, USA, 2012. Curran Associates Inc.
[2] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition, 2014.
[3] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Sco Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going
deeper with convolutions. In Computer Vision and Paern Recognition (CVPR),
2015.
[4] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun
Wang, Tobias Weyand, Marco Andreeo, and Hartwig Adam. Mobilenets: E-
cient convolutional neural networks for mobile vision applications, 2017.
[5] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-
Chieh Chen. Mobilenetv2: Inverted residuals and linear bolenecks, 2018.
[6] Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it).
volume 57, pages 10–14, 02 2014.
[7] Y. Chen, T. Krishna, J. S. Emer, and V. Sze. Eyeriss: An energy-ecient recon-
gurable accelerator for deep convolutional neural networks. IEEE Journal of
Solid-State Circuits, 52(1):127–138, 2017.
[8] Bing Liu, Danyin Zou, Lei Feng, Shou Feng, Ping Fu, and Junbao Li. An fpga-
based cnn accelerator integrating depthwise separable convolution. Electronics,
8(3):281, 2019.
[9] L. Bai, Y. Zhao, and X. Huang. A cnn accelerator on fpga using depthwise
separable convolution. IEEE Transactions on Circuits and Systems II: Express
Briefs, 65(10):1415–1419, 2018.
[10] Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze. Eyeriss v2: A exible
accelerator for emerging deep neural networks on mobile devices, 2018.
[11] Daisuke Miyashita, Edward H. Lee, and Boris Murmann. Convolutional neural
networks using logarithmic data representation, 2016.
[12] Sebastian Vogel, Mengyu Liang, Andre Guntoro, Walter Stechele, and Gerd
Ascheid. Ecient hardware acceleration of cnns using logarithmic data repre-
sentation with arbitrary log-base. In Proceedings of the International Conference
on Computer-Aided Design, ICCAD 18, New York, NY, USA, 2018. Association
for Computing Machinery.
[13] Y. Huan, J. Xu, L. Zheng, H. Tenhunen, and Z. Zou. A 3d tiled low power accel-
erator for convolutional neural network. In 2018 IEEE International Symposium
on Circuits and Systems (ISCAS), pages 1–5, 2018.
[14] J. Jo, S. Kim, and I. Park. Energy-ecient convolution architecture based on
rescheduled dataow. IEEE Transactions on Circuits and Systems I: Regular Papers,
65(12):4196–4207, 2018.
[15] K. Chang and T. Chang. Vwa: Hardware ecient vectorwise accelerator for
convolutional neural network. IEEE Transactions on Circuits and Systems I:
Regular Papers, 67(1):145–154, 2020.
[16] Forrest N. Iandola, Song Han, Mahew W. Moskewicz, Khalid Ashraf, William J.
Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer
parameters and ¡0.5mb model size, 2016.
[17] V. Sze, Y. Chen, T. Yang, and J. S. Emer. Ecient processing of deep neural
networks: A tutorial and survey. Proceedings of the IEEE, 105(12):2295–2329,
2017.
[18] Y. Lin and T. S. Chang. Data and hardware ecient design for convolutional
neural network. IEEE Transactions on Circuits and Systems I: Regular Papers,
65(5):1642–1651, 2018.
9
