ChewBaccaNN: A Flexible 223 TOPS/W BNN Accelerator by Andri, Renzo et al.
1ChewBaccaNN: A Flexible 223 TOPS/W BNN Accelerator
Renzo Andri∗, Geethan Karunaratne∗†, Lukas Cavigelli∗§, Luca Benini∗‡
∗Integrated Systems Laboratory, ETH Zurich, Zurich, Switzerland †IBM Research, Zurich, Switzerland
§Huawei Research, Zurich, Switzerland ‡DEI, University of Bologna, Bologna, Italy
Abstract—Binary Neural Networks enable smart IoT devices,
as they significantly reduce the required memory footprint
and computational complexity while retaining a high network
performance and flexibility. This paper presents ChewBaccaNN,
a 0.7 mm2 sized binary CNN accelerator designed in global-
foundries 22 nm technology. By exploiting efficient data re-use,
data buffering, latch-based memories, and voltage scaling, a
throughput of 233 GOPS is achieved while consuming just 1.2 mW
at 0.4V/154MHz for the inference of binary CNNs with 7x7
kernels, leading to a core energy efficiency of 223 TOPS/W. This
is up to 4.4x better than other specialized binary accelerators
while supporting full flexibility in kernel configurations. With as
little as 3.9 mJ, using an 8-fold ResNet-18, a Top-1 accuracy on
ImageNet of 67.5% can be achieved, which is just 1.8% less than
using the full-precision ResNet-18.
Keywords—Binary Neural Networks, Hardware Accelerators
I. INTRODUCTION
CONVOLUTIONAL neural networks have revolutionizedthe machine learning field in recent years, outperforming
humans in image recognition [1] and advancing the state-of-
the-art for a wide range of applications [2]. Most of these net-
works require billions of multiply-accumulate operations per
frame and millions of trained parameters. This is incompatible
with hundreds of kB of on-chip memory and the limited energy
available on battery-powered, low-cost IoT sensor nodes. A
common approach is to analyze the data in the cloud for ease of
deployment and to avoid the performance limitations of edge
devices. However, transmitting the data comes with a high
energy cost, introduces privacy concerns, requires expensive
infrastructure, and has a high latency. Thus, the focus has
shifted to analyze the data near the sensor, which has given rise
to research into hardware accelerators [3]. Besides data flow
optimizations and data reuse to improve the throughput and
energy efficiency of convolution operations, particular focus
has been put on reducing off-accelerator data transfers [4],
[5], memory accesses and the operation complexity through
architectural improvements, reduced-precision operands [6],
skipping zero-multiplications [7], and on-the-fly feature map
and model compression [8].
Furthermore, the research community has proposed several
algorithmic advances to enable inference at the edge reducing
the number of operations by choosing smaller kernel sizes
[9], reducing input channels [9], increasing the number of
skippable operations [10], weight sharing [11], decompose
2D convolutional layers into 2 sequential 1D (depth-wise
and/or spatial) convolutional layers [12], limiting inter-channel
dependency by channel grouping [13], and using quantized
fixed-point arithmetics (e.g., [14]).
In particular, arithmetic precision scaling has shown sig-
nificant potential, due to its inherent reduction in memory
requirements (i.e., feature maps and weights), and compu-
tational complexity. CNNs have been proven to be robust
to quantization down to 16-bit without retraining [15], and
8 bits with retraining [16]. Binary Neural Networks push the
idea of quantization to the limit and quantize weights, and
feature maps to a single bit representing the values -1 and 1.
XNOR-Net extends the stochastic gradient descent algorithm
(commonly used to train NNs) by quantizing the weights and
activations in the forward path and scales the feature maps
`1 matrix norm of the weight kernels [17]. On the chal-
lenging ImageNet Large Scale Visual Recognition Challenge1,
Rastegari et al. achieved 51.2% using a binarized ResNet-18,
which was a significant drop of -18.1%. Courbariaux et al.
achieved then state-of-the-art results with 99.04 on MNIST
(+0.34%), 97.47 on SVHN (-0.09%), and 89.85% CIFAR-10
(-0.46%), but these datasets are much simpler than ImageNet
[18]. Recent research has been focusing mainly on minimizing
the quantization error (e.g., scaling feature maps in XNOR-
Net), improving the loss function, and reducing the gradient
error [19], [20]. Recently, the accuracy gap between BNNs
and their full-precision equivalent have been brought down to
12% (DoReFaNet on Alexnet [21]). This accuracy gap can be
reduced by using multiple binary layers (i.e., weight bases) in
parallel and binarizing around multiple thresholds (i.e., activa-
tion bases). Using 3 weight bases, Lin et al. [22] have achieved
69.3%/89.2% (Top-1/Top-5, –8.3%/–6.0% vs. ResNet-18) and
Zhuang et al. reached 72.8%/90.5 (Top-1/Top-5, –3.2%/–2.4%
vs. ResNet-50) using 8 weight bases [23]. Using multiple bases
directly impacts the throughput and energy per inference for
any given hardware accelerator, negating some of the benefits
of BNNs. However, it provides the opportunity to smoothly
scale from a highly efficient, less accurate network to almost
full accuracy inference.
Several BNN hardware accelerators have shown an energy
efficiency gain of around two orders of magnitude is achievable
compared to quantized neural network accelerators. By reason
of avoidance of off-chip data transfers and the exploiting of
the extreme reduction in arithmetic complexity where full-
precision multiply-accumulate becomes binary xnor-popcount.
In [24], Conti et al. presents a 46 TOPS/W BNN accelera-
tor tightly-connected to a general-purpose processor (without
considering off-accelerator memory and I/O costs), UNPU is a
stand-alone accelerator for flexible weights (i.e., 1-16 Bits) and
feature maps and reaches 51 TOPS/W for fully-binary NN [25],
and BinarEye presents a full-custom accelerator for BNN fully-
1composed of more than 1 million images of 1000 different object classes
ar
X
iv
:2
00
5.
07
13
7v
1 
 [e
es
s.S
P]
  1
2 M
ay
 20
20
2specialized with 64 channels and only 2×2 kernels, achieving
a peak core energy-efficiency of 230 TOPS/W.
In this paper, we present ChewBaccaNN, a binary CNN
accelerator designed in GlobalFoundries 22 nm FDX tech-
nology, exploiting the reduced arithmetic complexity using
xnor-gates and efficient popcount adder trees, energy-efficient
latch-based memories, and voltage scaling. ChewBaccaNN
supports the main primitives of CNNs such as pooling, (batch)
normalization, and ReLU activation and is configurable for
kernel sizes from 1×1 to 7×7. Due to the significant reduction
of the memory requirements, all feature maps can be stored on-
chip without incurring unnecessary accesses to external storage
devices. ChewBaccaNN is the 2nd generation of XNORBIN
[26] (in UMC 65 nm), exploiting the more advanced, lower
power 22 nm FD-SOI technology node. SRAMs have been
replaced with much more energy-efficient latch-based mem-
ories; furthermore, power-gating of unused memory banks,
silencing of unused compute units, and extending the voltage
range have been introduced. Additionally, we have extended
ChewBaccaNN to support pooling on the binary activations,
average pooling, and support for residual paths through the
Near-Memory Compute Unit (NMCU), enabling inference on
SoA binary ResNets. The energy efficiency at 0.4 V has been
increased by 2.3× from 95 to 223 TOPS/W at roughly the
same throughput/frequency of 244 GOPS/156 MHz and 2.3×
lower power consumption (i.e., 2.6 to 1.13 mW) compared to
XNORBIN at its lowest-voltage operating point (0.8 V). Ad-
ditionally, we present in detail algorithmic optimizations that
can be used to exploit fully-binary neural network acceleration.
Then, we evaluate the top SoA networks on ChewBaccaNN,
demonstrating flexibility combined with leading-edge energy
efficiency.
II. BNN AND RELATED HW OPTIMIZATION
In BNNs, the weights and intermediate feature maps are
quantized to a single bit: X ∈ {−1, 1}nin×h×b W ∈
{−1, 1}nout×nin×ky×kx . After the multiplication, which is
reduced to a xor operation, these products and a bias value
are accumulated in full-precision followed by re-binarizing the
sum. This (re-)binarization replaces the activation function and
has the behavior of a signum function while zero is mapped
to 1, i.e., ∀x > 0 : sgn(x) = 1 and ∀x ≤ 0 : sgn(x) = −1.
The k-th output feature map ok is the sum of convolutions of
every binarized input feature map in with the corresponding
binary weights wk,n and the bias Ck:
ok = sgn
(
Ck + α
ni−1∑
n=0
sgn(in) ∗ sgn(wk,n)
)
(1)
BNNs have much potential for optimizations: First, the
memory footprint can be reduced up to 32× (i.e., in case
of FP32). Second, multiplication can be simplified to xor.
Training of BNNs is not trivial, as gradients are not smooth
anymore due to the high non-linearity of the parameter space.
The most common approach bases on shadow weights in high
precision (e.g., FP32). These weights are binarized during the
forward-propagation. During back-propagation, the gradients
are applied to the shadow weights. Even though the binariza-
tion itself is not derivable, it can be modeled as the identity
function. This can be interpreted as propagating the stochastic
expected value of the gradient to the weights (i.e., straight-
through estimator) [27]. Typically, the weights of the first and
last layer are not binarized, as this has a strong impact to the
network performance, but contribute a small part of the total
computational complexity [21].
Bipolar activations and weights in, wk,n ∈ {−1, 1} are
mapped to the binary representation iˆn, wˆk,n ∈ {0, 1}, en-
abling the replacement of the multiplication with a binary
xnor operation. The bipolar values are mapped to the binary
values with the function b(x) = 12 (x+ 1), which needs to be
compensated after accumulation by multiplying by 2 and sub-
tracting the number of accumulated contribution. By merging
and rearranging, the formula can be turned into the same form
as in Eq. 1, where multiplications within the convolutions are
replaced by xnor operations indicated by ∗⊕¯:
ok = sgn
(
Ck + α
nin−1∑
n=0
(
2 · iˆn ∗⊕¯ wˆk,n − kykx
))
= sgn
nin32 −1∑
n=0
∑
(∆x,∆y)
2 popcount
(ˆ
iy+∆y,x+∆x32n:+32 ⊕¯wˆ∆y,∆xk,32n:+32
)
− 32

The (−32)-term compensates for the 32 parallely calculated
input channel contributions. Even though the weights and
feature maps stay binary, the accumulation itself and the
learned bias/scaling factors for batch normalization are still
represented in non-binary form:
oˆk = sgn
(
Cˆk + αˆk
∑nin−1
n=0 iˆn ∗⊕¯ wˆk,n − µk
σk
)
(2)
In the implementation, µk, Ck, αk and σk can be absorbed
into a single threshold θk = Cˆkσk|αˆk| +µk applied on the sum of
products:
ok =
{
−1, ∑nin−1n=0 iˆn ∗⊕¯ wˆk,n < θk
1, else
(3)
Pooling is applied after convolution, scaling, and batch normal-
ization, but before the re-binarization and, therefore, in the non-
binary domain. However, due to the monotonicity and commu-
tativity, the pooling function can be calculated as a Boolean
operation (e.g., max/min-pooling becomes AND/OR reduction
and average-pooling become Boolean majority voting).
Pool(ok(x, y)) =
{−1, max
m,n∈{0,1}
(ok(2x+m, 2y + n)) < θk
1, else
(4)
=
−1,
∧
m,n∈{0,1}
(ok(2x+m, 2y + n) < θk)
1, else
(5)
3Feature Map Mem.
DMA
BPU Array
R
ow
B
an
k 
1
R
ow
B
an
k 
0
R
ow
B
an
k 
n
Memory Interconnect
P
ar
am
et
er
B
uff
er
N
M
C
U
X-bar
Sc
he
du
le
r
I/O
Fig. 1: Top-level schematic of ChewBaccaNN
III. ARCHITECTURE
The architecture of ChewBaccaNN is illustrated in Fig. 1
and its components are explained as follows:
xnor_sum_0
Img
Wgt
16
+
Img_0
Wgt_0
8
CSR
0
1
2
3
4
5
6
Img_6
Wgt_6
CSR
0
1
2
3
4
5
6
16
xnor_sum_6
16
16
Popcount
 adder
6
6
Fig. 2: Architecture of the basic processing unit (BPU)
Each Basic Processing Unit (BPU) performs a 1D convolu-
tion of an input image row with a kernel row from 16 input
channels at a time, by employing xnor sum units consisting
of 16 xnor-gates each and a popcount adder tree.
The xnor sum unit is replicated 7 times to produce outputs
corresponding to a window size of at most 7 input feature map
pixels in a row. Outputs from all units are accumulated together
with a second stage adder tree to create 1D inner product,
shown in Fig. 2). 7 BPUs are instantiated in order to support
kernel sizes up to 7×7. The outputs of all these instances are
pipelined to increase throughput and are then added up in a
third stage adder tree to produce 2D inner product. Each of the
xnor-sum instances is fed with the input activations and weight
data through a controlled shift-register to enable data reuse.
The same BPU array datapath is reused to perform binary max-
pooling operation, where the 2nd and 3rd stage adder trees are
flanked by a 1 bit comparator (AND gate) tree.
ChewBaccaNN comes with a Feature Map Memory FMM
and data buffering. The FMM stores the feature maps and
the partial sums of the convolutions. The memory is divided
into two blocks, where one serves as the data source (i.e.,
current input feature maps), and the other serves as data sink
(i.e., partial or final output feature maps). They are swapped
after each layer. If the FMM is dimensioned to fit the largest
intermediate FMs, no energy-costly off-chip memory accesses
are needed during inference. To conceal the weight loading
latency, the PB is enriched with a double buffering feature.
The Parameter Buffer PB stores the weights, the binarization
thresholds, and the configuration parameters. In the optimal
case, it stores all the weights of the network to avoid I/O
for weight loading. If the parameters are too many to fit on-
chip, the PB can be reused to buffer off-chip accesses. The
Row Banks are used to buffer rows of the input feature maps
for frequent accesses. It also contains rows of filter weights
corresponding to the batch of output channels calculated in
parallel. Since these row banks need to be rotated when
shifting the convolution window down, they are connected
to the BPU array through a crossbar. The crossbar connects
the registers inside the BPUs, the controlled shift registers
(L1) (CSRs) containing kernel input feature map elements,
and the filter weight elements. These are shifted when the
convolution window is moved forward. The DMA moves data
independently from FMM and PB into (via Row Banks) and
out of the BPU array. Scheduler: According to a given layer
configuration of a CNN, the scheduler controls the crossbar
on how and when to route feature map and weight data from
the Row Banks to the BPUs in order to compute row-wise
partial sums for each member in the batch. The Near Memory
Compute Unit (NMCU) is illustrated in Fig. 3 and is used for
on-the-fly computation when writing back to the main memory.
This includes partial sum calculations from the BPU array,
accumulating residual paths from the FMM, the binarization,
and storing back to the FMM in a packed format (i.e., 16
activations).
+
+
Binarize
From
Memory
From
Config
From
BPU
To
Memory
Partial
Sums
Binarization
Data
New Non-
Binary Data
Fig. 3: Datapath of the near memory compute unit NMCU
To maximize kernel-level reuse, filter weights are retained
in BPUs while streaming selected image rows through BPUs,
and partial sums are computed concurrently for a batch of
output feature map tile to maximize row-level image reuse. The
resulting integer value that is produced from the BPU array in
each cycle as a result of a horizontally sliding convolution
window is forwarded to the DMA controller via the Near-
Memory Compute Unit (NMCU). The CU accumulates the
partial results by means of a read-add-write operation. After
the final accumulation of partial results, the unit also performs
the thresholding/re-binarization operation (i.e., activation and
batch normalization). The binary results are packed into 16-bit
words and written back to the memory by the DMA unit. The
scheduling is determined with the objective of maximizing the
data reuse at different levels of the memory hierarchy. The
scheduling algorithm and mapping of operations to BPU units
are explained in Alg. 1 based on the filter dimensions kw and
kh, the spatial input dimensions iw and ih, the depths (i.e.,
40 50 100 150 200 250 300 350
0
50
100
150
200
250
Throughput [GOPS]
C
or
e
E
ne
rg
y
E
ffi
c.
[T
O
PS
/W
]
tlp,synth = 1.5ns ( ), 2.0ns ( ), 4.0ns ( ), 8.0ns ( )
kx × ky = 7× 7 ( ), 5× 5 ( ), 3× 3 ( )
Fig. 4: Throughput vs. Core Energy Efficiency for various
timing constraints at 0.4 V supply voltage and FMM=4 kB.
input channels ci, and output channels co) and the channels
tile sizes cˆi and cˆo. Parallel execution is indicated in the
brackets in lines 4, 6-8. After one tile of output channels cˆo
is computed (i.e., binary convolution, (optional) pooling and
thresholding), the process is repeated for all output channel
tiles. In the next step, the next subsequent layer with new
layer configuration is loaded, the FMM sink/source direction
is reversed, and calculated.
Algorithm 1 High-level scheduling of 1 BNN layer
Require: kw, kh, iw, ih, ci, co, cˆi, cˆo
1: for no ← 0 to co/cˆo do
2: for ni ← 0 to ci/cˆi do
3: for nrow ← 0 to ih do
4: for bo ← 0 to cˆo (per BPU array in parallel) do
5: pass kernels of channel bo to Bank memory
6: for krow ← −(kh/2) to (kh/2) (parallel) do
7: for kcol ← −(kw/2) to (kw/2) (paral.) do
8: for bi ← 0 to cˆi (HW parallel) do
9: pass input feature map pixel (bi, nrow,
ncol) and weight (bo, bi, kcol, krow) to
BPU array bo
calculate xnor-popcount and accumulate
10: end for
11: end for
12: end for
13: Binarize final partial sums
Pool operation (if applicable)
14: end for
15: end for
16: end for
17: end for
Feature
Map Memory
Block 2
Parameter
Buffer
Row Bank
 Memory
BPUs
Mem
Interconnect
Feature
Map Memory
Block 1
FMM
Block 2
Fig. 5: Floorplan of ChewBaccaNN core
IV. RESULT
A. Physical Implementation
ChewBaccaNN has been implemented with a 7.5 track
standard-cell libraries in Globalfoundries 22nm FDX technol-
ogy, synthesized with Synopsys Design Compiler 2018.06.
Cadence Innovus 18.11 was used for back-end design and
power simulation, and Questa Modelsim 10.6b has been used
for verification and extraction of switching activities for power
simulation. To reach the highest energy efficiency, we operate
at VDD=0.4 V with 0.1 V forward body-biasing. To scale the
voltage down to this level and to reduce the energy cost
per memory access by 3.5×, we use standard-cell memories
(SCMs) instead of SRAMs. The SCMs are designed with
hierarchical clock gating and address/data silencing mecha-
nisms, thus when a bank is not accessed the whole latch array
consumes no dynamic power [6]. The SCMs are composed
of multiple banks of 256 words × 32 bit (1 kB). The FMM
is dimensioned to fit the two largest consecutive layers of
the network, which has to be supported without tiling. We
have selected them to be either 16 and 32 SCM banks (48 kB)
for AlexNet or 2×73 banks (146 kB) for both AlexNet and
ResNet-18; the parameter buffer 2 banks (3.5 kB) and the 7
row bank memories consist of 1 SCM bank each (i.e., 3.5 kB
in total). The final floorplan is shown in Fig. 5. It can be seen
that a large part of the chip (i.e., 97%) are memories, whereas
the compute units just occupy 1% of the total chip area of
0.7 mm2. The power consumption has been evaluated with
back-annotated post-layout simulation. The stimuli vectors,
including weights, input feature maps, and configuration is
generated from the network model in Torch by a custom
compiler written in Python, and streamed to the accelerator
through the input interface. I/O energy has been estimated with
21 pJ/bit based on LPDDR3 memory access cost evaluation
[4].
B. Throughput to Energy Trade-Off
We have synthesized (i.e., same color for same timing
constraints) and run back-ends at various timing constraints,
to explore out the energy-efficiency to throughput trade-off at
0.4 V. The results are shown in Fig. 4. Due to the lower density
of the SCM memories compared to SRAMs, the chip reports
high leakage (i.e., dashed line represents energy efficiency limit
5based on leakage power), which limits the core energy effi-
ciency to 185/100/39.0 TOPS/W with 48k˙B FMM (i.e., same
size than XNORBIN) at a throughput of 241/123/44.3 GOPS
and a core power consumption of 1.3/1.2/0.90 mW for the full-
utilization case of 72/52/32 kernel sizes. Fortunately, most of
the FMM banks stay unused and can, therefore, be power-
gated. Thus, with 4 active banks (4 kB) the energy consump-
tion reduces to 1.08/0.99/0.68 mW and the energy efficiency
can be increased up to 223/124/65 TOPS/W. UNPU supports
CNN inference with 3×3 and 5×5 kernels, ChewBaccaNN
is 2.0× more energy-efficient than UNPU [25]. For 7×7
kernels ChewBaccaNN outperforms UNPU even 4.4×. The
highest efficiency value has been reported by BinarEye with
230 TOPS/W, but BinarEye can handle only 64 channels and is
limited to uncommon 2×2 kernels, and is 2× larger (1.4 mm2)
than ChewBaccaNN.
C. Accuracy and Energy-Efficiency for Various BNNs
In this section, we compare several published BNNs and
how efficiently they map to ChewBaccaNN in Tab. I. The first
two BNN networks have been used in embedded applications.
Rusci et al. trained and implemented a VGG-like network on
the CIFAR-10 dataset which does image recognition on 32×32
colored images with 10 classes, and achieves an accuracy
of 86.6% (4.6% less than FP32 baseline) [28]. Cerutti et al.
presented a BNN for Sound Event Detection on the Freesound
database with 28 classes [29]. The audio data is converted to a
Mel-frequency cepstral spectrogram and fed to a binary CNN
with 5 layers with 3×3 kernels, followed with 3 layers with
1×1 kernels. They achieve 77.9% of accuracy (i.e., a drop
of 7.9% with respect to full-precision). Both networks have
been implemented on the low-power Gapuino board featuring
the GAP8 multi-core processor, with 8+1 energy-optimized
RISC-V cores implementing the RISC-V RV32IMC ISA and
the Xpulp ISA extensions (i.e., supporting bit-manipulation,
popcount, hardware-loop, post-increment load/stores, ...) [30].
In Cerutti et al.’s network, we tile the input FM in 2 tiles with
an overlap of 20 columns to fit in the 146 kB FMM. Running
these networks on ChewBaccaNN has an actual energy con-
sumption of 3.9/296 μJ/frame, a 195/86× improvement over
the embedded GAP8 implementation.
Furthermore, we evaluate the SoA BNN networks on the
challenging ImageNet image classification challenge. DoRe-
FaNet with the smallest reported Top-1 accuracy gap of 12.3%
[21], XNOR-Net++ with the best Top-1 accuracy with standard
BNNs [31], and the two multi-base binary networks ABC-
Net [22] and Zhuang et al. [23]. DoReFaNet was evaluated
with 48 kB FMM, and the others with 76 kB FMM. ABC-Net
extends ResNet-18 with 3 parallel BNN layers per original full-
precision layer (i.e., 3 weight bases) and reaches an accuracy of
61.0 (-8.3%) requiring 1.46 mJ and Zhuang et al. 67.5 (-1.8%)
with 8× bases with a 3.9 mJ energy cost per frame. XNOR-
Net++ can be run at an energy cost of 487 μJ at 61% Top-1
accuracy and a throughput of 23 GOPS.
V. CONCLUSION
We have described a best-in-class BNN accelerator that
achieves an energy efficiency of up to 223 TOPS/W, while
TABLE I: Real network performance on SoA BNNs. Gap
shows the perf. difference to the full-precision baseline net.
Acc. Gap Util.% Core Eff. Dev. Eff. P En. Throughput
% Δ% % TOPS/W TOPS/W mW mJ GOPS FPS
CIFAR-10, 10 classes, 3×32×32
[28] VGG 86.8 -4.6 54.3 24.5 7.5 3.2 1.4 23.9 2.3k
Freesound, Sound Event Detection with 28 classes, 1×400×64 MFCC Spectrograms
[29] 5C3×3−3C1×1 77.9 -7.9 18.9 10.6 7.7 2.2 296 16.8 7.2
ImageNet, 1’000 classes, 3×224×224
[21] AlexNet 43.6 -12.3 45.3 27.8 13.8 2.1 141 28.5 14.7
[31] ResNet-18 57.1 -12.2 52.0 14.6 6.5 3.5 487 23.0 7.2
[22] 3× ResNet-18 61.0 -8.3 52.0 14.6 6.5 3.5 1462 23.0 2.4
[23] 8× ResNet-18 67.5 -1.8 52.0 14.6 6.5 3.5 3900 23.0 0.9
keeping the flexibility of running a wide range of BNNs, an
improvement of 4.4x over the closest competitor UNPU, which
reaches 51 TOPS/W. We can run state-of-the-art BNNs such
as XNOR-Net++ or the one by Zhuang et al., reaching a Top-
1 accuracy of 61.0% and 67.5% on ImageNet at merely 1.46
and 3.9 mJ/frame, respectively.
ACKNOWLEDGMENT
The authors thank A. Al Bahou for his contributions to the
design and implementation of the first generation XNORBIN.
REFERENCES
[1] K. He et al., “Deep Residual Learning for Image Recognition,” Proc.
IEEE CVPR, pp. 770–778, 2015.
[2] A. Khan et al., “A survey of the recent architectures of deep convolu-
tional neural networks,” arXiv preprint arXiv:1901.06032, 2019.
[3] V. Sze et al., “Efficient Processing of Deep Neural Networks: A Tutorial
and Survey,” Proc. IEEE, vol. 105, no. 12, 2017.
[4] R. Andri et al., “Hyperdrive: A Multi-Chip Systolically Scalable Binary-
Weight CNN Inference Engine,” IEEE JETCAS, 2019.
[5] S. Moini et al., “A resource-limited hardware accelerator for convolu-
tional neural networks in embedded vision applications,” IEEE TCAS-II,
2017.
[6] R. Andri et al., “YodaNN: An Architecture for Ultra-Low Power Binary-
Weight CNN Acceleration,” IEEE TCAD, 2017.
[7] S. Han et al., “EIE: Efficient Inference Engine on Compressed Deep
Neural Network,” in ACM/IEEE ISCA, 2016.
[8] L. Cavigelli et al., “Ebpc: Extended bit-plane compression for deep
neural network inference and training accelerators,” IEEE JETCAS,
vol. 9, no. 4, pp. 723–734, 2019.
[9] F. N. Iandola et al., “SqueezeNet: AlexNet-level accuracy with 50x
fewer parameters and <0.5 MB model size,” arXiv:1602.07360, 2016.
[10] S. Han et al., “Learning Both Weights and Connections for Efficient
Neural Networks,” NIPS, 2015.
[11] ——, “Deep Compression - Compressing Deep Neural Networks with
Pruning, Trained Quantization and Huffman Coding,” in ICLR, 2016.
[12] M. Sandler et al., “MobileNetV2: Inverted Residuals and Linear Bot-
tlenecks,” in IEEE CVPR, 2018.
[13] X. Zhang et al., “ShuffleNet: An Extremely Efficient Convolutional
Neural Network for Mobile Devices,” in IEEE CVPR, 2018.
[14] R. Zhao et al., “Improving Neural Network Quantization without
Retraining using Outlier Channel Splitting,” in ICML, 2019.
[15] D. Lin et al., “Fixed point quantization of deep convolutional networks,”
in Proc. ICML, 2016.
[16] B. Jacob et al., “Quantization and training of neural networks for
efficient integer-arithmetic-only inference,” in IEEE CVPR, 2018.
6[17] M. Rastegari et al., “XNOR-Net: ImageNet Classification Using Binary
Convolutional Neural Networks,” in Proc. ECCV, 2016.
[18] I. Hubara et al., “Binarized neural networks,” in Adv. NIPS, 2016.
[19] H. Qin et al., “Binary neural networks: A survey,” Pattern Recogn.,
2020.
[20] M. Spallanzani et al., “Additive noise annealing and approximation
properties of quantized neural networks,” arXiv:1905.10452, 2019.
[21] S. Zhou et al., “Dorefa-net: Training low bitwidth convolutional neural
networks with low bitwidth gradients,” arXiv:1606.06160, 2016.
[22] X. Lin et al., “Towards accurate binary convolutional neural network,”
in Adv. NIPS, 2017.
[23] B. Zhuang et al., “Structured binary neural networks for accurate image
classification and semantic segmentation,” in IEEE CVPR, 2019.
[24] F. Conti et al., “XNOR Neural Engine: A Hardware Accelerator IP for
21.6-fJ/op Binary Neural Network Inference,” IEEE TCAS, 2018.
[25] J. Lee et al., “UNPU: A 50.6TOPS/W unified deep neural network ac-
celerator with 1b-to-16b fully-variable weight bit-precision,” in ISSCC,
2018.
[26] A. A. Bahou et al., “XNORBIN: A 95 TOp/s/W hardware accelerator
for binary convolutional neural networksc,” in IEEE COOL Chips, 2018.
[27] Y. Bengio et al., “Estimating or propagating gradients through stochastic
neurons for conditional computation,” arXiv:1308.3432, 2013.
[28] M. Rusci et al., “Always-ON visual node with a hardware-software
event-based binarized neural network inference engine,” in ACM CF,
2018.
[29] G. Cerutti et al., “Sound Event Detection with Binary Neural Networks
on Tighlty Power-Constrained IoT Devices,” under review. [Online].
Available: https://iis-people.ee.ethz.ch/∼andrire/paper.pdf
[30] M. Gautschi et al., “Near-Threshold RISC-V core with DSP extensions
for scalable IoT endpoint devices,” IEEE TVLSI, 2017.
[31] A. Bulat et al., “XNOR-Net++: Improved binary neural networks,”
arXiv:1909.13863, 2019.
