HCM: Hardware-Aware Complexity Metric for Neural Network Architectures by Karbachevsky, Alex et al.
HCM: Hardware-Aware Complexity Metric
for Neural Network Architectures
Alex Karbachevsky †∗ Chaim Baskin †∗ Evgenii Zheltonozshkii †∗ Yevgeny Yermolin †
Freddy Gabbay ◦ Alex M. Bronstein † Avi Mendelson †
†Technion – Israel Institute of Technology, Haifa, Israel
◦Ruppin Academic Center, Haifa, Israel,
{alex.k, chaimbaskin, evgeniizh}@campus.technion.ac.il
{yevgeny.ye, bron, avi.mendelson}@cs.technion.ac.il
{freddyg}@ruppin.ac.il
ABSTRACT
Convolutional Neural Networks (CNNs) have become com-
mon in many fields including computer vision, speech recog-
nition, and natural language processing. Although CNN
hardware accelerators are already included as part of many
SoC architectures, the task of achieving high accuracy on
resource-restricted devices is still considered challenging,
mainly due to the vast number of design parameters that need
to be balanced to achieve an efficient solution. Quantization
techniques, when applied to the network parameters, lead
to a reduction of power and area and may also change the
ratio between communication and computation. As a result,
some algorithmic solutions may suffer from lack of memory
bandwidth or computational resources and fail to achieve the
expected performance due to hardware constraints. Thus, the
system designer and the micro-architect need to understand at
early development stages the impact of their high-level deci-
sions (e.g., the architecture of the CNN and the amount of bits
used to represent its parameters) on the final product (e.g., the
expected power saving, area, and accuracy). Unfortunately,
existing tools fall short of supporting such decisions.
This paper introduces a hardware-aware complexity metric
that aims to assist the system designer of the neural network
architectures, through the entire project lifetime (especially at
its early stages) by predicting the impact of architectural and
micro-architectural decisions on the final product. We demon-
strate how the proposed metric can help evaluate different
design alternatives of neural network models on resource-
restricted devices such as real-time embedded systems, and
to avoid making design mistakes at early stages.
1. INTRODUCTION
Domain-specific systems were found to be very efficient,
in general, and when developing constrained devices such
as IoT, in particular. A system architect of such devices
must consider hardware limitations (e.g., bandwidth and local
memory capacity), algorithmic factors (e.g., accuracy and
representation of data), and system aspects (e.g., cost, power
∗Equal contribution.
Figure 1: Our 3×3 kernel 8-bit processing engine (PE) layout
using the TSMC 28nm technology. The carry-save adder can
fit 12-bit numbers, which is large enough to store the output
of the convolution.
envelop, battery life, and more). Many IoT and other resource-
constrained devices provide support for applications that use
convolutional neural networks (CNNs). Such algorithms can
achieve spectacular performance in various tasks covering a
wide range of domains such as computer vision, medicine,
autonomous vehicles, etc. Notwithstanding, CNNs contain a
vast number of parameters and require a significant amount of
computation during inference, thus monopolizing hardware
resources and demanding massively parallel computation
engines; see teh example shown in Fig. 1.
These requirements have led to great interest in using
custom-designed hardware for efficient inference of CNNs
that would allow the promise of neural networks to be used in
real-life applications by deploying them on low-power edge
devices. Developing such systems requires a new set of de-
sign tools due to the tight entanglement between the algorith-
mic aspects, the chip architecture and the constraints the end
1
ar
X
iv
:2
00
4.
08
90
6v
2 
 [c
s.L
G]
  2
6 A
pr
 20
20
product needs to meet. In particular, great efforts were made
to develop low-resource CNN architectures [14, 24, 27, 33].
One example of such architectural changes is the splitting
of the regular 3×3 convolutions into a channel-wise 3×3
convolution followed by a 1×1 one. Another way to reduce
the computational burden is to quantize the CNN parameters,
weights and activations, employing low-bit integer repre-
sentation of the data instead of the expensive floating point
representation. Recent quantization-aware training schemes
[8, 10, 16, 34, 35] achieve near-baseline accuracy for as low
as 2-bit quantization. The benefit of quantizing the CNN is
twofold: both the number of gates required for each multiply-
accumulate (MAC) operation and the amount of routing are
reduced. The decision regarding which algorithm to choose
may depend on the architecture (e.g., FPGA or ASIC), the
accuracy requirements, and their impact on performance and
power. Thus, the architect needs to make these fundamental
decisions early in the developing process and no existing tool
can help predict these design factors ahead of time.
The impact of the high-level structure of the accelerator,
e.g., the type of CNN levels and the representation of the
operands, on the power, the area and the performance of
the final product needs to be defined and predicted at an
early stage of the project. Recent research has shown that
ASIC-based architectures are the most efficient solution for
CNN accelerators both in datacenters [6, 17, 22] and in real-
time platforms [5, 11, 25]. Accordingly, we demonstrate the
proposed metric and design tool on an implementation of
a streaming [23] ASIC-based convolutional engine. Never-
theless, our methodology can be applied for the evaluation
of other types of architectures, such as FPGA-based accel-
erators [1, 2, 29]. In both cases, the development process
includes an important trade-off between the logical gates
area, their routing on the silicon and the performance of the
resulting system. Unfortunately, all these parameters also
depend on the representation of the data, and its impact on
both communication and computation. To date, there is no
quantitative metric for this trade-off available at the design
stage of the CNN accelerator and no tool exists that can assist
the architect to predict the impact of high level decisions on
the important design parameters. Ideally, the designer would
like to have an early estimation of the chip resources required
by the accelerator as well as the performance, accuracy and
power it can achieve.
A critical difficulty in trying to predict the design parame-
ters for CNN-based systems is the lack of a proper complexity
metric. Currently, the most common metric for calculating
the computational complexity of CNN algorithms is the num-
ber of MAC operations denoted as OPS (or FLOPS in case
of floating-point operations). This metric, however, does not
take into account the data format or additional operations per-
formed during the inference, such as memory accesses and
communication. For that reason, the number of FLOPS does
not necessarily correlate with runtime [18] or the required
amount of computational resources. This paper proposes to
use a different metric for assessing the complexity of CNN-
based architectures: the number of bit operations (BOPS)
as presented by Baskin et al. [3]. We show that BOPS is
well-suited to the task of comparing different architectures
with different weight and activation bitwidths.
Contribution.
This paper makes the following contributions: Firstly, we
study the impact of CNN quantization on the hardware imple-
mentation in terms of computational resources and memory
bandwidth considerations. Specifically, we study a single
layer in the neural network.
Secondly, we extend the previously proposed computation
complexity for quantized CNNs, termed BOPS [3], with a
communication complexity analysis to identify the perfor-
mance bottlenecks that may arise from the data movement.
Thirdly, we extend the roofline model [32] to accommodate
this new notion of complexity. We also demonstrate how this
tool can be used to assist architecture-level decisions at the
early design stages.
Lastly, we implement a basic quantized convolution block
with various bitwidths on 28nm processes to demonstrate
an accurate estimation of the power/area of the hardware
accelerator. This allows changing high-level design decisions
at early stages and saves the designer from major mistakes
that otherwise would be discovered too late. We compare our
estimations with previous approaches and show significant
improvement in accuracy of translation between algorithmic
complexity to hardware resource utilization.
The rest of the paper is organized as follows: Section 2
reviews the related work; Section 3 describes a proposed
hardware-aware complexity metric; Section 4 provides roof-
line analysis of CNN layer design using the proposed metric;
Section 5 provides the experimental results using common
CNN architecture and Section 6 concludes the paper.
2. RELATED WORK
In this section, we provide an overview of prior work
that proposed metrics for estimating the complexity and
power/energy consumption of different workloads, focus-
ing on neural networks. The most commonly used metric
for evaluating computational complexity is FLOPS [19]: the
amount of floating point operations required to perform the
computation. In the case of integer operations, the obvious
generalization of FLOPS is OPS, which is just the number
of operations. A fundamental limitation of these metrics is
the assumption that the same data representation is used for
all operations; otherwise, the calculated complexity does not
reflect the real one. Wang et al. [31] claim that FLOPS is an
inappropriate metric for estimating the performance of work-
loads executed in datacenters and proposed a basic operations
metric that uses a roofline-based model, taking into account
the computational and communication bottlenecks for more
accurate estimation of the total performance.
In addition to general-purpose metrics, other metrics were
developed specifically for evaluation of neural network com-
plexity. Mishra et al. [20] define the “compute cost” as the
product of the number of fused multiplyâA˘S¸add (FMA) oper-
ations and the sum of the width of the activation and weight
operands, without distinguishing between floating- and fixed-
point operations. Using this metric, the authors claimed to
have reached a 32× “compute cost” reduction by switch-
ing from FP32 to binary representation. Still, as we show
further in our paper, this is a rather poor estimate for the
hardware resources/area needed to implement the compu-
tational element. Jiang et al. [15] notes that a single met-
2
ric cannot comprehensively reflect the performance of deep
learning (DL) accelerators. They investigate the impact of
various frequently-used hardware optimizations on a typi-
cal DL accelerator and quantify their effects on accuracy
and throughout under-representative DL inference workloads.
Their major conclusion is that high hardware throughput is
not necessarily highly correlated with the end-to-end high
inference throughput of data feeding between host CPUs and
AI accelerators. Finally, Baskin et al. [3] propose to gener-
alize FLOPS and OPS by taking into account the bitwidth
of each operand as well as the operation type. The resulting
metric, named BOPS (binary operations), allows area estima-
tion of quantized neural networks including cases of mixed
quantization.
The aforementioned metrics do not provide any insight on
the amount of silicon resources needed to implement them.
Our work, accordingly, functions as a bridge between the
CNN workload complexity and the real power/area estima-
tion.
3. COMPLEXITY METRIC
In this section, we describe our hardware-aware complexity
metric (HCM), which takes into account the CNN topology,
and define the design rules of efficient implementation of
quantized neural networks. The HCM metric assesses two
elements: the computation complexity, which quantifies the
hardware resources needed to implement the CNN on silicon,
and the communication complexity, which defines the mem-
ory access pattern and bandwidth. We describe the changes
resulting from switching from a floating-point representation
to a fixed-point one, and then present our computation and
communication complexity metrics. All results for the fixed-
point multiplication presented in this section are based on the
Synopsys standard library multiplier using TSMC’s 28nm
process.
3.1 The impact of quantization on hardware
implementation
Currently, the most common representation of weights and
activations for training and inference of CNNs is either 32-
bit or 16-bit floating-point numbers. The fixed-point MAC
operation, however, requires significantly fewer hardware
resources, even for the same input bitwidth. To illustrate this
fact, we generated two multipliers: one for 32-bit floating-
point1 and the other for 32-bit fixed-point operands. The
results in Table 1 show that a fixed-point multiplier uses ap-
proximately eight time less area, gates, and power than the
floating-point counterpart. Next, we generated a convolution
with a k× k kernel, a basic operation in CNNs consisting of
k2 MAC operations per output value. After switching from
floating-point to fixed-point, we explored the area of a single
processing engine (PE) with variable bitwidth. Note that ac-
cumulator size depends on the network architecture: the maxi-
mal bitwidth of the output value is bwba+ log2(k
2)+ log2(n),
where n is number of input features. Since the extreme values
are very rare, however, it is often possible to reduce the accu-
mulator width without harming the accuracy of the network
[6].
1FPU100 from https://opencores.org/projects/fpu100
0 2 4 6 8 10 12 14 16
Bitwidth
0
1000
2000
3000
4000
A
re
a
Quadratic fit
PE area
Figure 2: Area (A) vs. bitwidth (b) for a 3× 3 PE with a
single input and output channel. All weights and activations
use the same bitwidth and the accumulator width is 4 bit
larger, which is enough to store the result. The quadratic fit
is A = 12.39b2 +86.07b−14.02 with goodness of fit R2 =
0.9999877.
Fig. 2 shows the silicon area of the PE as a function of
the bitwidth. We performed a polynomial regression and ob-
served a quadratic dependence of the PE area on the bitwidth,
with the coefficient of determination R2 = 0.9999877. This
nonlinear dependency demonstrates that quantization impact
a network hardware resources is quadratic: reducing bitwidth
of the operands by half reduces area and, by proxy, power ap-
proximately by a factor of four (contrary to what is assumed
by, e.g., Mishra et al. [20]).
3.2 Computation
We now present the BOPS metric defined in Baskin et al.
[3] as our computation complexity metric. In particular, we
show that BOPS can be used as an estimator for the area
of the accelerator. The area, in turn, is found to be linearly
related to the power in case of the PEs.
The computation complexity metric describes the amount
of arithmetic “work” needed to calculate the entire network
or a single layer. BOPS is defined as the number of bit opera-
tions required to perform the calculation: the multiplication
of n-bit number by m-bit number requires n ·m bit operations,
while addition requires max(n,m) bit operations. In partic-
ular, Baskin et al. [3] show that a k× k convolutional layer
with ba-bit activations and bw-bit weights requires
BOPS = mnk2
(
babw+ba+bw+ log2(nk
2)
)
(1)
bit operations, where n and m are, respectively, the number
of input and output features of the layer. The formula takes
into account the width of the accumulator required to accom-
modate the intermediate calculations, which depends on n.
The BOPS of an entire network is calculated as a sum of
the BOPS of the individual layers. Creating larger accelera-
tors that can process more layers in parallel involves simply
replicating the same individual PE design.
In Fig. 3, we calculated BOPS values for the PEs from
Fig. 2 and plotted them against the area. We conclude that
for a single PE with variable bitwidth, BOPS can be used to
predict the PE area with high accuracy.
3
Table 1: 32-bit floating-point and 32-bit fixed-point multiplier hardware complexity in terms of the number of gates, area, and
power.
Multiplier Gates Cells Area [µm2] Power[mW ]Internal Switching Leakage Dynamic
Floating-Point 40090 17175 11786 2.76 1.31 0.43 10.53
Fixed-Point 5065 1726 1489 0.49 0.32 0.04 1.053
0 500 1000 1500 2000 2500 3000 3500
BOPS
0
1000
2000
3000
4000
A
re
a
Linear fit
bw = 4, ba = 2
bw = 6, ba = 4
bw = 8, ba = 6
bw = 10, ba = 8
bw = 12, ba = 10
bw = 16, ba = 14
bw = 4, ba = 4
bw = 6, ba = 6
bw = 8, ba = 8
bw = 10, ba = 10
bw = 12, ba = 12
bw = 16, ba = 16
Figure 3: Area (A) vs. BOPS (B) for a 3×3 PE with a single
input and output channel and variable bitwidth. The linear fit
is A = 1.694B+153.46 with goodness of fit R2 = 0.9989.
Next, we tested the predictive power of BOPS scaling with
the size of the design. We generated several designs with
variable bitwidths, bw = ba ∈ {4,6,8}, and variable numbers
of PEs n = m ∈ {4,8,16} used to accommodate multidimen-
sional inputs and outputs that typically arise in real CNN
layers. Fig. 4 shows that the area depends linearly on the
BOPS for the range of two orders of magnitude of total area
with goodness of fit R2 = 0.9980. We conclude that since
BOPS provides a high-accuracy approximation of the area
and power required by the hardware, it can be used as an
104 105 106
BOPS
104
105
106
A
re
a
Linear fit
4 bit
6 bit
8 bit
n=m=4
n=m=8
n=m=16
n=m=32
Figure 4: Area (A) vs. BOPS (B) for a 3×3 PE with variable
input (n) and output (m) feature dimensions, and variable
bitwidth. Weights and activations use the same bitwidth and
the accumulator width is set to log2(9m) ·bw ·ba.
early estimator. While the area of the accelerator depends on
the particular design of the PE, this only affects the slope of
the linear fit, since the area is still linearly dependent on the
amount of PEs. An architect dealing with algorithms only
can use definitions such as the number of input features and
output features, kernel size etc. and get an early estimation
how much power is needed to solve the network, without hav-
ing any knowledge about VLSI constraints in advance. Using
information such as a number of input/output features and
kernel size, it is possible to immediately assess the amount
of area the PEs occupy on the silicon.
3.3 Communication
Another important aspect of hardware implementation of
CNN accelerators is memory communication. The transmis-
sion of data from the memory and back is often overlooked by
hardware implementation papers [1, 5, 28] that focus on the
raw calculation ability to determine the performance of their
hardware. In many cases, there is a difference between the
calculated performance and real-life performance, since real-
life implementations of accelerators are often memory-bound
[17, 21, 30].
For each layer, the total memory bandwidth is the sum of
the activation and weight sizes read and written from memory.
In typical CNNs used, e.g., in vision tasks, the first layers
consume most of their bandwidth for activations, whereas
in deeper layers that have smaller but higher-dimensional
feature maps (and, consequently, a bigger number of kernels),
weights are the main source of memory communication.
We assume that each PE can calculate one convolution
result per clock cycle and the resulting partial sum is saved in
the cache. In Fig. 5, we show typical memory access progress
at the beginning of the convolutional layer calculation. At
first stage, the weights and the first k rows of the activations
are read from memory at maximal possible speed to start
the calculations as soon as possible. After the initial data
are loaded, the unit reaches a “steady state”, in which it
needs to read from the memory only one new input value
per clock cycle (other values are already in the cache). We
assume the processed signals to be two-dimensional (images),
which additionally requires k new values to be loaded in the
beginning of each new row.
Note that until the weights and the first activations are
loaded, no calculations are performed. The overhead band-
width of the pre-fetch stage can be mitigated by doing work
in greater batch sizes, loading the weights once and reading
several inputs for the same weights. By doing this, we mini-
mize the penalty for reading the weights compared to reading
the actual input data to perform the calculation. In the case of
real-time processing, however, larger batches are not possible
4
Weights
and
Activations
Bandwidth
Clock 
Cycles
Steady state - Activations
Figure 5: Per-layer memory access pattern
because the stream of data needs to be completed on-the-fly.
We focus on the latter real-time streaming regime in this pa-
per because of its great importance in a range of applications
including automotive, security, and finance. The memory
access pattern depicted in Fig. 5 must be kept in mind when
designing the hardware, since it may limit the performance
of the accelerator and decrease its power efficiency.
4. ROOFLINE ANALYSIS
So far, we discussed the use of BOPS for the prediction
of the physical parameters of the final product, such as the
expected power and area. In this section, we extend the
BOPS model to a system level, by introducing the OPS-based
roofline model. The traditional roofline model, as introduced
by Williams et al. [32], suggests depicting the dependencies
between the performance (e.g., GFLOPS/second) and the
operation density (the average number of operations per in-
formation unit transferred over the memory bus). Now, for
each machine we can draw “roofs”: the horizontal line that
represents its computational bounds and the diagonal line that
represents its maximal memory bandwidth. An example of
the roofline for three applications assuming infinite compute
resources and memory bandwidth is shown in Fig. 6. The
maximum performance a machine can achieve for any appli-
cation is visualized by the area below both bounds, shaded in
green.
OPS-based roofline model.
Since, as indicated in Section 3.1, FLOPS cannot be used
for efficient estimation of the complexity of quantized CNNs,
we introduce a new model that is based on the BOPS metric
presented in Section 3.2. This model, to which we refer
as the OPS-based roofline model, replaces the GFLOPS/s
axis of the roofline plot with a performance metric more
adequate for neural networks, e.g., number of operations per
second (OPS/s), and the second metric that measures the
computational complexity with operations per bit (OPS/bit).
Using generic operations and bits allows plotting quantized
accelerators with different bitwidths on the same plot.
As an example of the proposed approach, we use two differ-
ent ResNet-18 layers (a deep layer, which is computationally-
intensive, and an early one, which is memory-intensive) on
four different accelerator designs: 32-bit floating-point, 32-
bit fixed-point, and quantized 8-bit and 4-bit fixed-point. The
accelerators were implemented using standard ASIC design
tools, as detailed in Section 5 and were built using the TSMC
28nm technology, using standard 2.4GHz DDR-4 memory
100 101
Operational intensity[FLOPS/byte]
100
2× 100
3× 100
4× 100
6× 100
P
ef
or
m
an
ce
[G
F
L
O
P
S
/s
]
Memory bound
Computation bound
App 1
App 2
App 3
Figure 6: Roofline example. In the case of App1, memory
bandwidth prevents the program from achieving its expected
performance. In the case of App2, the same happens due to
limited computational resources. Finally, App3 represents a
program that could achieve its maximum performance on a
given system.
with a 64-bit data bus.
The first example employs an accelerator with a silicon
area of 1mm2 and 800MHz clock speed. The task is the 11th
layer of ResNet-18 that has a 3× 3 kernel and 256 input
and output features of dimension 14×14 each. Looking at
Table 1, it is possibly to fit only 85 32-bit floating-point mul-
tipliers in 1mm2. That allows installation of 9 PEs (without
taking into account the area required for the accumulators
of the partial sums) and calculation of convolutions with the
3×3×3×3 kernel in a single clock. Using the known areas
of 4-bit, 8-bit and 16-bit PEs, we extrapolate the area of the
32-bit fixed point PE to be 16676µm. From these data, we
can place 60 PEs with 7× 7× 3× 3 kernels, 220 PEs with
14×14×3×3 kernels and 683 PEs with 26×26×3×3 ker-
nels, for 32-bit, 16-bit and 8-bit fixed-point PEs, respectively,
on the given area.
To calculate the amount of OPS/s required by the layer,
under the assumption that a full single pixel is produced every
clock, we need to calculate the amount of MAC operations
required to calculate one output pixel (n×m× (k2+1)) and
multiply it by the accelerator frequency. To calculate the
OPS/bit for each design, we divide the amount of MAC oper-
ations in the layer by the total number of bits transferred over
the memory bus, which includes the weights, the input and
the output activations. The layer requires 524.288 TOPS/s to
be calculated without stalling for memory access and com-
putation. The available performance of the accelerators is
summarized in Table 2 and visualised using the proposed
OPS-based roofline analysis in Fig. 7.
In this example, the application’s requirements are out
of the scope of the product definition. On one hand, all
accelerators are computationally bound (all horizontal lines
are below the application’s requirements), indicating that
we do not have enough PEs to calculate the layer in one
run. On the other hand, even if we decide to increase the
computational density by using stronger quantization or by
increasing the silicon area (and the cost of the accelerator),
we would still hit the memory bound (represented by the
5
Table 2: The amount of computation (OPS/s) provided by the
accelerators and memory throughput (OPS/bit) required by
the 11th layer of ResNet-18.
32-bit 32-bit 16-bit 8-bit
float fixed quant. quant.
GOPS/s 72.00 392.0 1568 5408
OPS/bit 5.82 5.82 11.63 23.26
diagonal line). In this case, the solution should be found at
the algorithmic level or by changing the product’s targets;
e.g., we can calculate the layer in parts, increase the silicon
area of while decreasing the frequency in order not to hit
memory wall, or decide to use another algorithm.
Our second example explores the feasibility of implement-
ing the second layer of ResNet-18 that has a 3×3 kernel and
64 input and output features of dimension 56×56. For this
example, we increase the silicon area to 6mm2 and lower the
frequency to 100MHz, as proposed earlier, and add a 4-bit
quantized accelerator for comparison purposes. The layer re-
quires 4.1 GOPS/s. The accelerators results are summarized in
Table 3 and visualised with the OPS-based roofline analysis
in Fig. 8.
From Fig. 8 we can see that our 32-bit and 16-bit accel-
erators are still computationally bound, while the 8-bit and
4-bit quantized accelerators meet the demands of the layer.
In particular, the 8-bit accelerator is located at the border of
computational ability, meaning this solution has nearly opti-
mal resource allocation, since the hardware is fully utilized.
Still, the final choice of the configuration depends on other
parameters such as the accuracy of the CNN.
Both examples demonstrate that decisions made at early
stages have a critical impact on the quality of the final prod-
uct. For example, applying an aggressive quantization to
the network or increasing the silicon size may not improve
the overall performance of the chip if its performance is
memory-bound. From the architect’s point of view, it is im-
portant to balance between the computation and data transfer.
100 101 102
OPS/bit
1011
1012
1013
1014
1015
O
P
S
/s
Memory bound
32 bits fp
32 bits
16 bits
8 bits
Figure 7: OPS roofline: 3×3 kernel, 256 input and output
14×14 features, 1mm2 accelerator with 800MHz frequency,
with DDR of 2.4GHz with 64 bit data bus.
Table 3: The amount of computation (OPS/s) provided by the
accelerators and memory throughput (OPS/bit) required by
second layer of ResNet-18.
32-bit 32-bit 16-bit 8-bit 4-bit
float fixed quant. quant. quant.
GOPS/s 49.00 324.0 1296 3969 11236
OPS/bit 9.16 9.16 18.32 36.64 73.27
Nonetheless, this balance can be achieved in different ways:
at the micro-architecture level, at the algorithmic level or
by changing the data representation. The architect may also
consider (1) changing the hardware to provide faster commu-
nication (requires more power and is more expensive), (2)
appling communication bandwidth compression algorithms
[4, 7], (3) using fewer number of bits to represent weights
and activations (using 3- or 4-bit representation may solve
the communication problem, at the cost of reducing the ex-
pected accuracy), or (4) changing the algorithm to transfer
data slower (even though that solves the bandwidth issue,
the possible drawback is a reduced throughput of the whole
system). The proposed OPS-based roofline model helps the
architect to choose alternative. After making major architec-
tural decisions we can use BOPS to get an estimation of the
impact of different design choices on the final product, such
as the expected area, power, optimal operational point, etc.
The next section will examine these design processes from
the system design point of view.
5. HCM METRIC EVALUATION
After introducing the use of BOPS as a metric for the hard-
ware complexity of CNN-based algorithms and the use of the
OPS-based roofline model to help the architect understand
how the decisions at the algorithmic level may impact the
characterizations of the final product, this section aims to
provide a holistic view of the design process of systems with
CNN accelerators. We conducted an extensive evaluation of
the design and the implementation of a commonly used CNN
100 101 102
OPS/bit
1011
1012
1013
O
P
S
/s
Memory bound
32 bits fp
32 bits
16 bits
8 bits
4 bits
Figure 8: OPS roofline: 3× 3 kernel, 64 input and output
56×56 features, 6mm2 accelerator with 100MHz frequency,
with DDR of 2.4GHz with 64 bit data bus.
6
architecture for ImageNet [26] classification, ResNet-18 [12].
We also compared our metric to prior art [20] in terms of
correspondence between complexity score to hardware uti-
lization for CNN parameters with various bitwidths.
5.1 Experimental methodology
We start the evaluation of the HCM metric with a compre-
hensive review of the use of BOPS as part of the design and
implementation process of a CNN accelerator. This section
shows the trade-offs involved in the process and verifies the
accuracy of the proposed model. It focuses on the implemen-
tation of a single PE since PEs are directly affected by the
quantization process. The area of an individual PE depends
on the chosen bitwidth, while the change in the amount of
input and output features changes both the required number
of PEs and the size of the accumulator. The leading example
we use implemented an all-to-all CNN accelerator that can
calculate n input features and m output features in parallel,
as depicted in Fig. 9. For simplicity, we choose an equal
number of input and output features. In this architecture, all
the input features are routed to each of the m blocks of PEs,
each calculating a single output feature. The implementation
input
features
PEs outputfeatures
Figure 9: All-to-all topology with n×m processing elements.
was done for an ASIC using the TSMC 28nm technology
library, 800MHz system clock and in the nominal corner of
VDD = 0.81V. For the power analysis, input activity factor,
and sequential activity factor, we used the value of 0.2. The
tool versions are listed in Table 4.
Table 4: CAD Design Tools
Language Verilog HDL
Logic Simulation ModelSim 19.1
Synthesis Synopsys Design Compiler 2017.09-SP3
Place and route Cadence Innovus 2019.11
For brevity, we present only the results of experiments at
800 MHz clock frequency. We performed additional experi-
ments at 600 MHz and 400 MHz (obviously, neither BOPS
nor the area of an accelerator depends on the chip frequency),
but do not show these results. As shown in Section 4, lower-
ing the frequency of the design can help to avoid the memory
bound, but incurs the penalty of slower solution time.
Our results show a high correlation between the area of the
design and BOPS. The choice of an all-to-all topology shown
in Fig. 9 was made because of an intuitive understanding
of how the accelerator calculates the outputs of the network.
This choice, however, has a greater impact on the layout’s
routing difficulty, with various alternatives such as broadcast
or systolic topologies [6]. For example, a systolic topology, a
popular choice for high-end NN accelerators [17], eases the
routing complexity by using a mesh architecture. Although
it reduces the routing effort and improves the flexibility of
the input/output feature count, it requires a more complex
control for the data movement to the PEs.
To verify the applicability of BOPS to different topologies,
we also implemented a systolic array shown in Fig. 10, where
each 1× 1 PE is connected to 4 neighbors with the ability
to bypass any input to any output without calculations. The
input feature accumulator is located at the input of the PE.
This topology generates natural 4×1 PEs, but with proper
control, it is possible to create flexible accelerators. In the
FA
C
A
C
H
E
Conv.
1x1 conv.
Figure 10: Systolic array of processing elements
systolic design, we generated three square arrays, 4×4, 8×
8, and 16× 16, with bw = ba ∈ {4,6}. The systolic array
area was found to be in linear relation with BOPS, with the
goodness of fit R2 = 0.9752.
5.2 System-level design using HCM
In this section, we analyze the acceleration of ResNet-
18 using the proposed metrics and show the workflow for
early estimation of the hardware cost when designing an
104 105
BOPS
104
105
A
re
a
Linear fit
4 bit
6 bit
n=m=4
n=m=8
n=m=16
Figure 11: Area (A) vs. BOPS (B) for a systolic array of 3×3
PEs with variable input (n) and output (m) feature dimensions,
and variable bitwidth. Weights and activations use the same
bitwidth and the accumulator width is set to log2(9m) ·bw ·ba.
7
101 102
OPS/bit
1011
1012
1013
1014
1015
1016
OP
S/
s
Computation bound
Memory bound
Single-clock calculation
Serial calculation
101 102
OPS/bit
101 102
OPS/bit
2 bit quantization 3 bit quantization 4 bit quantization
Figure 12: ResNet-18 roofline analysis for all layers. Red dots are the performance required by the layer, and green dots are the
equivalent performance using partial-sum computation. The curves connecting between the dots are linear segments distorted by
the log-log axes, and are displayed only for convenience.
accelerator. We start the discussion by targeting an ASIC
that runs at 800MHz, with 16×16 PEs and the same 2.4GHz
DDR-4 memory with a 64-bit data bus as used in Section 4.
The impact of changing these constraints is discussed at the
end of the section. For the first layer, we replace the 7× 7
convolution with three 3× 3 convolutions, as proposed by
He et al. [13]. This allows us to simplify the analysis by
employing universal 3×3 PEs for all layers.
We start the design process by comparing different alter-
natives using the new proposed OPS-based-roofline analysis
since it helps to explore the design trade-offs between the
multiple solutions. We calculate the amount of OPS/s pro-
vided by 16× 16 PEs at 800MHz and the requirements of
each layer. To acquire the roofline, we need to calculate the
OPS/bit, which depend on the quantization level. For ResNet-
18, the current art [9] achieves 69.56% top-1 accuracy on
ImageNet for 4-bit weights and activations, which is only
0.34% less than 32-bit floating-point baseline (69.9%). Thus
we decided to focus on 2-, 3- and 4-bit quantization both for
weights and activations, which can achieve 65.17%, 68.66%,
and 69.56% top-1 accuracy, correspondingly.
For a given bitwidth, OPS/bit is calculated by dividing the
total number of operations by the total number of bits trans-
ferred over the memory bus, consisting of reading weights
and input activations and writing output activations. Fig. 12
presents OPS-based roofline for each quantization bitwidth.
Please note that for each layer we provided two points: the
red dots are the performance required by the layer, and the
green dots are the equivalent performance using partial-sum
computation.
Fig. 12 clearly indicates that this accelerator is severely
limited by both computational resources and lack of enough
bandwidth; the system is computationally bounded, which
could be inferred from the fact that it does not have enough
PEs to calculate all the features simultaneously. Nevertheless,
the system is also memory-bound for any quantization level,
meaning that adding more PE resources would not solve the
problem. It is crucial to make this observation at the early
stages of the design since it means that micro-architecture
changes would not be sufficient to solve the problem.
One possible solution, as presented in Section 4, is to
divide the channels of the input and output feature maps
into smaller groups, and use more than one clock cycle to
calculate each pixel. In this way, the effective amount of
the OPS/s required for the layer is reduced. In the case that
the number of feature maps is divisible by the number of
available PEs, the layer will fully utilize the computational
resources, which is the case for every layer except the first one.
Reducing the number of PEs, however, also reduces the data
efficiency, and thus the OPS/bit also decreases, shifting the
points to the left on the roofline plot. Thus, some layers still
require more bandwidth from the memory than what the latter
can supply. In particular, in the case of 4-bit quantization,
most of the layers are memory-bound. The only option that
properly utilizes the hardware is 2-bit quantization, for which
all the layers except one are within the memory bound of
the accelerator. Another option for solving the problem is
to either change the neural network topology being used,
or add a data compression scheme on the way to and from
the memory [4, 7]. Adding compression will reduce the
effective memory bandwidth requirement and allow adding
more PEs in order to meet the performance requirements – at
the expense of cost and power.
At this point, BOPS can be used to estimate the power
and the area of each alternative for implementing the the
accelerator using the PE micro-design. In addition, we can
explore other trade-offs, such as the influence of modifying
some parameters that were fixed at the beginning: lowering
the ASIC frequency will decrease the computational bound,
which reduces the cost and only hurts the performance if the
network is not memory-bounded. An equivalent alternative
is to decrease the number of PEs. Both procedures will
reduce the power consumption of the accelerator as well
the computational performance. The system architect may
also consider changing the parameters of the algorithm being
used, e.g., change the feature sizes, use different quantization
for the weights and for the activations, include pruning, and
more.
It is also possible to reverse design order: start with a
BOPS estimate of the number of PEs that can fit into a given
area, and then calculate the ASIC frequency and memory
bandwidth that would allow full utilization the accelerator.
This can be especially useful if the designer has a specific
area or power goal.
To summarize this section, from the architecture point of
view it is extremely important to be able to predict, at the
8
104 105 106
BOPS
104
105
106
A
re
a
Linear fit
BOPS
103 104 105
Compute cost
Linear fit
Compute cost
Figure 13: Comparison of BOPS and “compute cost” [20]
predictive power. BOPS – 5% error, “compute cost” – 15%.
early stages of the design, if the proposed (micro)architecture
is going to meet the project targets. At the project exploration
stage, the system architect has plenty of alternatives to choose
from to make the right trade-offs (or even negotiate to change
the product definition and requirements. Introducing such
alternatives later may be painful or even near to impossible.
5.3 Comparison with prior metrics
In this section, we compare the BOPS [3] metric to another
complexity metric, introduced by Mishra et al. [20]. A good
complexity metric should have a number of properties. First,
it should reflect the real cost of the design. Second, it should
be possible to calculate it from micro-designs or prior design
results, without needing to generate complete designs. Last,
it should generalize well, providing meaningful predictions
for a wide spectrum of possible design parameter values. We
compare our choice of computational complexity assessment,
BOPS, with the “compute cost” proposed by Mishra et al.
[20]. To analyze the metrics, we use our real accelerator area
results from Section 5 and error bands of linear extrapolation
of the measured values. To remind the reader, BOPS and
“compute cost” are defined as follows:
BOPS = mnk2
(
babw+ba+bw+ log2(nk
2)
)
(2)
compute_cost = mnk2(ba+bw) (3)
The error of predicting a new point with “compute cost” is
15% within 2 orders of magnitude, whereas using BOPS, is
only 5%. As shown in Fig. 13, “compute cost” introduces a
systematic error: each of the distinguishable groups of three
points corresponding to a single value of the number of input
and output features creates a separate prediction line. This
may lead to higher errors in case of extrapolation from a
single value of the number of input and output features or a
wide range of the considered bitwidth.
6. DISCUSSION AND CONCLUSIONS
CNN accelerators are commonly used in different systems,
starting from IoT and other resource-constrained devices,
and ending in datacenters and high-performance computers.
Designing accelerators that meet tight constraints is still a
challenging task, since the current EDA and design tools do
not provide enough information to the architect. To make the
right choice, the architects need to understand at the early
stages of the design the impact of high-level decisions they
make on the final product, and to be able to make a fair
comparison between different design alternatives.
In this paper, we showed that one of the fundamental short-
comings of the current design methodologies and tools is the
use of GFLOPS as a metric for estimating the complexity
of existing hardware solutions. The first contribution of this
paper is the definition of the HCM as a metric for hardware
complexity. We demonstrated its application to the prediction
of such product characteristics as power, performance, etc.
The second contribution of the paper is the introduction
of the OPS-based roofline model as a supporting tool for the
architect at the very early stages of the development. We
showed that this model allows the comparison of different
alternatives of the design and the determination of the opti-
mality and feasibility of the solution.
Lastly, we provided several examples of realistic designs,
using an actual implementation with standard design tools
and a mainstream process technology. By applying the pro-
posed metric, we could build a better system and indicate
to the system architect that certain CNN architectures may
better fit the constraints of a specific platform. In particular,
our metric confirmed that CNN accelerators are more likely
to be memory, rather that computationally bound [17, 30].
Although this paper is mainly focused on ASIC-based
architectures, the same methodology can be applied to many
other systems, including FPGA-based implementations and
other system-specific domains that allow trading off accuracy
and data representation with different physical parameters
such as power, performance, and area.
Acknowledgments
The research was funded by the Hyundai Motor Company
through the HYUNDAI-TECHNION-KAIST Consortium,
National Cyber Security Authority, and the Hiroshi Fujiwara
Technion Cyber Security Research Center.
References
[1] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu, M. Foltin,
R. S. Williams, P. Faraboschi, W.-m. W. Hwu, J. P. Strachan,
K. Roy, and D. S. Milojicic, “PUMA: A programmable ultra-efficient
memristor-based accelerator for machine learning inference,” in
Proceedings of the Twenty-Fourth International Conference on
Architectural Support for Programming Languages and Operating
Systems, ser. ASPLOS âA˘Z´19. New York, NY, USA: Association
for Computing Machinery, 2019, p. 715âA˘S¸731. [Online]. Available:
https://doi.org/10.1145/3297858.3304049
[2] C. Baskin, N. Liss, E. Zheltonozhskii, A. M. Bronstein, and A. Mendel-
son, “Streaming architecture for large-scale quantized neural networks
on an fpga-based dataflow platform,” in 2018 IEEE International Par-
allel and Distributed Processing Symposium Workshops (IPDPSW).
IEEE, 2018, pp. 162–169.
[3] C. Baskin, E. Schwartz, E. Zheltonozhskii, N. Liss, R. Giryes, A. M.
Bronstein, and A. Mendelson, “UNIQ: Uniform noise injection for
non-uniform quantization of neural networks,” 2018.
[4] C. Baskin, B. Chmiel, E. Zheltonozhskii, R. Banner, A. M.
Bronstein, and A. Mendelson, “CAT: Compression-aware training
for bandwidth reduction,” arXiv preprint arXiv:1909.11481, 2019.
[Online]. Available: https://arxiv.org/abs/1909.11481
9
[5] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
efficient reconfigurable accelerator for deep convolutional neural net-
works,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–
138, Jan 2017.
[6] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible
accelerator for emerging deep neural networks on mobile devices,”
IEEE Journal on Emerging and Selected Topics in Circuits and Systems,
vol. 9, no. 2, pp. 292–308, 2019.
[7] B. Chmiel, C. Baskin, R. Banner, E. Zheltonozhskii, Y. Yermolin,
A. Karbachevsky, A. M. Bronstein, and A. Mendelson, “Feature
map transform coding for energy-efficient cnn inference,” arXiv
preprint arXiv:1905.10830, 2019. [Online]. Available: https:
//arxiv.org/abs/1905.10830
[8] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and
D. S. Modha, “Learned step size quantization,” in International
Conference on Learning Representations, 2020. [Online]. Available:
https://openreview.net/forum?id=rkgO66VKDS
[9] R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and
J. Yan, “Differentiable soft quantization: Bridging full-precision
and low-bit neural networks,” in The IEEE International
Conference on Computer Vision (ICCV), October 2019. [On-
line]. Available: http://openaccess.thecvf.com/content_ICCV_
2019/html/Gong_Differentiable_Soft_Quantization_Bridging_Full-
Precision_and_Low-Bit_Neural_Networks_ICCV_2019_paper.html
[10] P. Gysel, M. Motamedi, and S. Ghiasi, “Hardware-oriented ap-
proximation of convolutional neural networks,” arXiv preprint
arXiv:1604.03168, 2016.
[11] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J.
Dally, “EIE: efficient inference engine on compressed deep neural
network,” in 2016 ACM/IEEE 43rd Annual International Symposium
on Computer Architecture (ISCA). IEEE, 2016, pp. 243–254.
[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2016, pp. 770–778.
[13] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li,
“Bag of tricks for image classification with convolutional neural
networks,” in The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2019. [Online]. Available:
http://openaccess.thecvf.com/content_CVPR_2019/html/He_Bag_
of_Tricks_for_Image_Classification_with_Convolutional_Neural_
Networks_CVPR_2019_paper.html
[14] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan,
W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and
H. Adam, “Searching for MobileNetV3,” in The IEEE International
Conference on Computer Vision (ICCV), October 2019. [Online].
Available: http://openaccess.thecvf.com/content_ICCV_2019/html/
Howard_Searching_for_MobileNetV3_ICCV_2019_paper.html
[15] Z. Jiang, J. Li, and J. Zhan, “The pitfall of evaluating performance on
emerging AI accelerators,” arXiv preprint arXiv:1911.02987, 2019.
[Online]. Available: https://arxiv.org/abs/1911.02987
[16] Q. Jin, L. Yang, and Z. Liao, “Towards efficient training for
neural network quantization,” arXiv preprint arXiv:1912.10207, 2019.
[Online]. Available: https://arxiv.org/abs/1912.10207
[17] N. P. Jouppi, C. Young, N. Patil, D. A. Patterson, G. Agrawal, R. Ba-
jwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin,
C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V.
Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, R. C. Ho,
D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski,
A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law,
D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Mag-
giore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni,
K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross,
A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter,
D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tut-
tle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon,
“In-datacenter performance analysis of a tensor processing unit,” in
2017 ACM/IEEE 44th Annual International Symposium on Computer
Architecture (ISCA). IEEE, 2017, pp. 1–12.
[18] J. Lee, T. Won, T. K. Lee, H. Lee, G. Gu, and K. Hong, “Compounding
the performance improvements of assembled techniques in a
convolutional neural network,” arXiv preprint arXiv:2001.06268, 2020.
[Online]. Available: https://arxiv.org/abs/2001.06268
[19] F. H. McMahon, “The livermore fortran kernels: A computer
test of the numerical performance range,” Lawrence Livermore
National Lab., CA (USA), Tech. Rep., 1986. [Online]. Available:
http://www.netlib.org/benchmark/livermore
[20] A. Mishra, E. Nurvitadhi, J. J. Cook, and D. Marr, “WRPN:
Wide reduced-precision networks,” in International Conference
on Learning Representations, 2018. [Online]. Available: https:
//openreview.net/forum?id=B1ZvaaeAZ
[21] R. Morcel, H. Hajj, M. A. R. Saghir, H. Akkary, H. Artail, R. Khanna,
and A. Keshavamurthy, “FeatherNet: An accelerated convolutional
neural network design for resource-constrained FPGAs,” ACM
Transactions on Reconfigurable Technology and Systems (TRETS),
vol. 12, no. 2, pp. 6:1–6:27, Mar. 2019. [Online]. Available:
http://doi.acm.org/10.1145/3306202
[22] M. A. Raihan, N. Goli, and T. M. Aamodt, “Modeling deep
learning accelerator enabled GPUs,” in 2019 IEEE International
Symposium on Performance Analysis of Systems and Software
(ISPASS). IEEE, 2019, pp. 79–92. [Online]. Available: https:
//doi.org/10.1109/ISPASS.2019.00016
[23] V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling,
C.-J. Wu, B. Anderson, M. Breughe, M. Charlebois, W. Chou,
R. Chukka, C. Coleman, S. Davis, P. Deng, G. Diamos, J. Duke,
D. Fick, J. S. Gardner, I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao,
T. S. John, P. Kanwar, D. Lee, J. Liao, A. Lokhmotov, F. Massa,
P. Meng, P. Micikevicius, C. Osborne, G. Pekhimenko, A. T. R. Rajan,
D. Sequeira, A. Sirasao, F. Sun, H. Tang, M. Thomson, F. Wei, E. Wu,
L. Xu, K. Yamada, B. Yu, G. Yuan, A. Zhong, P. Zhang, and Y. Zhou,
“MLPerf inference benchmark,” arXiv preprint arXiv:1911.02549,
2019. [Online]. Available: https://arxiv.org/abs/1911.02549
[24] T. Ridnik, H. Lawen, A. Noy, and I. Friedman, “TResNet:
high performance GPU-dedicated architecture,” arXiv preprint
arXiv:2003.13630, 2020. [Online]. Available: https://arxiv.org/abs/
2003.13630
[25] S. Rivas-Gomez, A. J. Pena, D. Moloney, E. Laure, and S. Markidis,
“Exploring the vision processing unit as co-processor for inference,”
2018 IEEE International Parallel and Distributed Processing
Symposium Workshops (IPDPSW), May 2018. [Online]. Available:
http://dx.doi.org/10.1109/IPDPSW.2018.00098
[26] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” In-
ternational Journal of Computer Vision (IJCV), vol. 115, no. 3, pp.
211–252, 2015.
[27] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L.-C.
Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 4510–4520, 2018.
[28] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P.
Strachan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: A
convolutional neural network accelerator with in-situ analog arithmetic
in crossbars,” in Proceedings of the 43rd International Symposium
on Computer Architecture, ser. ISCA âA˘Z´16. IEEE Press, 2016, p.
14âA˘S¸26. [Online]. Available: https://doi.org/10.1109/ISCA.2016.12
10
[29] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong,
M. Jahre, and K. Vissers, “FINN: a framework for fast, scalable
binarized neural network inference,” in Proceedings of the 2017
ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays, ser. FPGA âA˘Z´17. New York, NY, USA: Association
for Computing Machinery, 2017, p. 65âA˘S¸74. [Online]. Available:
https://doi.org/10.1145/3020078.3021744
[30] E. Wang, J. J. Davis, P. Y. Cheung, and G. A. Constantinides,
“LUTNet: Rethinking inference in FPGA soft logic,” arXiv
preprint arXiv:1904.00938, 2019. [Online]. Available: https:
//arxiv.org/abs/1904.00938
[31] L. Wang, J. Zhan, W. Gao, Z. Jiang, R. Ren, X. He, C. Luo, G. Lu, and
J. Li, “BOPS, not FLOPS! a new metric and roofline performance
model for datacenter computing,” arXiv preprint arXiv:1801.09212,
2018. [Online]. Available: https://arxiv.org/abs/1801.09212
[32] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful
visual performance model for multicore architectures,” Communica-
tions of the ACM, vol. 52, no. 4, pp. 65–76, 2009.
[33] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda,
Y. Jia, and K. Keutzer, “FBNet: Hardware-aware efficient convnet
design via differentiable neural architecture search,” 2019 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pp.
10 726–10 734, 2018.
[34] J. Yang, X. Shen, J. Xing, X. Tian, H. Li, B. Deng, J. Huang,
and X.-s. Hua, “Quantization networks,” in The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June 2019.
[Online]. Available: http://openaccess.thecvf.com/content_CVPR_
2019/html/Yang_Quantization_Networks_CVPR_2019_paper.html
[35] X. Zhao, Y. Wang, X. Cai, C. Liu, and L. Zhang, “Linear symmetric
quantization of neural networks for low-precision integer hardware,” in
International Conference on Learning Representations, 2020. [Online].
Available: https://openreview.net/forum?id=H1lBj2VFPS
11
