FATNN: Fast and Accurate Ternary Neural Networks by Chen, Peng et al.
FATNN: Fast and Accurate Ternary Neural Networks
Peng Chen1∗, Bohan Zhuang2∗, Chunhua Shen1†
1The University of Adelaide 2Monash University
Abstract
Ternary Neural Networks (TNNs) have received much attention due to being
potentially orders of magnitude faster in inference, as well as more power efficient,
than full-precision counterparts. However, 2 bits are required to encode the ternary
representation with only 3 quantization levels leveraged. As a result, conventional
TNNs have similar memory consumption and speed compared with the standard
2-bit models, but have worse representational capability. Moreover, there is still a
significant gap in accuracy between TNNs and full-precision networks, hampering
their deployment to real applications. To tackle these two challenges, in this work,
we first show that, under some mild constraints, the computational complexity of
ternary inner product can be reduced by 2×. Second, to mitigate the performance
gap, we elaborately design an implementation-dependent ternary quantization
algorithm. The proposed framework is termed Fast and Accurate Ternary Neural
Networks (FATNN). Experiments on image classification demonstrate that our
FATNN surpasses the state-of-the-arts by a significant margin in accuracy. More
importantly, speedup evaluation comparing with various precisions is analyzed on
several platforms, which serves as a strong benchmark for further research.
1 Introduction
Equipped with high-performance computing and large-scale datasets, deep convolution neural net-
works (DCNN) have become a cornerstone for most computer vision tasks. However, a significant
obstacle for deploying DCNN algorithms to mobile/embedded edge devices with limited computing
resources is the ever growing computation complexity—in order to achieve good accuracy, the
models are becoming very heavy. To tackle this problem, much research effort has been spent on
model compression. Representative methods include model quantization [1, 2], network pruning
[3, 4] and neural architecture search for lightweight models [5, 6]. In this paper, we focus on model
quantization, which reduces the model complexity by representing a network with low-precision
weights and activations.
Network quantization aims to map the continuous input values within a quantization interval to
the corresponding quantization level, and a low-precision quantized value is assigned accordingly.
TNNs in which both the activations and weights are quantized to ternary, are particularly of interest
because most of the calculations can be realized with bit operations, thus completely eliminating
multiplications. However, there exists two limitations for conventional TNNs. The first limitation is
the inefficient implementation of TNNs. Specifically, the ternary representation of {−1, 0, 1} needs 2
bits to encode with one state wasted. As a result, with the conventional bitwise implementation of
quantized networks [2, 7], the complexity of ternary inner product is the same with the standard 2-bit
counterpart. Another limitation is the considerable accuracy drop compared with the full-precision
counterparts due to the much more compact capacity.
∗First two authors contributed equally.
†Corresponding author. E-mail: chunhua.shen@adelaide.edu.au
Preprint. Work in progress.
ar
X
iv
:2
00
8.
05
10
1v
1 
 [c
s.L
G]
  1
2 A
ug
 20
20
To handle these drawbacks, we introduce a new framework, termed FATNN, where we co-design
the underlying ternary implementation of computation and the quantization algorithm. In terms
of implementation, we fully leverage the property of the ternary representation to design a series
of bit operations to accomplish the ternary inner product with improved efficiency. In particular,
FATNN reduces the computational complexity of TNNs by 2×, which solves the existing efficiency
bottlenecks. In contrast to previous works, nearly no arithmetic operations exist in the proposed
implementation. Also, FATNN works efficiently on almost all kinds of devices (such as CPU,
GPU, DSP and FPGA) with basic bit operation instructions available. Furthermore, we design
the compatible ternary quantization algorithm in accordance to the mild constraints derived from
the underlying implementation. Specifically, we introduce a new way to learn the quantizers that
minimizes the task loss. Early works with learned quantizers either propose to learn the quantized
values [7] or seek to learn the quantization interval [8, 9]. However, most of them assume the uniform
quantizer step size which might still be non-optimal on optimizing network performance. To make
the low-precision discrete values sufficiently fit the statistics of the data distribution, we propose
to parameterize the step size of each quantization level and optimize them with the approximate
gradient. The overall approach is usable for quantizing both activations and weights, and works with
existing methods for backpropagation and stochastic gradient descent.
Our main contributions are summarized as follows:
• We propose a ternary quantization pipeline that can be applied on general platforms, in
which we co-design the underlying implementation and the quantization algorithm.
• We devise a fast ternary inner product implementation which reduces the complexity of
TNNs by 2× while keeping the bit-operation-compatible merit. We then design a highly
accurate ternary quantization algorithm in accordance with the constraints imposed by the
implementation. Specifically, we propose to improve quantizers by learning the step size of
each quantization level in conjunction with other network parameters.
• We evaluate the execution speed of FATNN and make comparison with other bit configura-
tions on various platforms. Moreover, experiments on ImageNet classification demonstrate
the superior performance of our FATNN over a few competitive state-of-the-art methods.
1.1 Related Work
Model quantization aims to quantize the weights, activations and even backpropagation gradients into
low-precision, to yield highly compact DCNN models compared to their floating-point counterparts.
As a result, most of the multiplication operations in network inference can be replaced by more
efficient addition or bitwise operations. In particular, BNNs [10, 11, 12, 13, 14, 15, 16, 17, 18],
where both weights and activations are quantized to binary tensors, are reported to have potentially
32× memory compression ratio, and up to 58× speed-up on CPU compared with the full-precision
counterparts. However, BNNs still suffer from sizable performance drop issue, hindering them
from being widely deployed. To make a trade-off between accuracy and complexity, researchers
also study ternary [19, 20] and higher-bit quantization [2, 21, 9, 22, 23]. In general, quantization
algorithms aim at tackling two core challenges. The first challenge is to design accurate quantizers
to minimize the information loss. Early works use handcrafted heuristic quantizers [2] while later
studies propose to adjust the quantizers to the data, basically based on matching the original data
distribution [2, 24], minimizing the quantization error [7, 25] or directly optimizing the quantizer with
stochastic gradient descent [21, 8, 26]. Moreover, another challenge is to approximate gradient of the
non-differentiable quantizer. To solve this problem, most studies focus on improving the training
via loss-aware optimization [27], regularization [28, 25, 29], knowledge distillation [1, 30], entropy
maximization [31, 32] and relaxed optimization [33, 34, 35, 36]. In addition to the quantization
algorithms design, the implementation frameworks and acceleration libraries [37, 38, 39, 40] are
indispensable to expedite the quantization technique to be deployed on energy-efficient edge devices.
For example, TBN [41] focuses on the implementation of ternary activation and binary weight
networks. daBNN [42] targets at the inference optimization of BNNs on ARM CPU devices.
GXNOR-Net [43] treats TNNs as a kind of sparse BNNs and propose an acceleration solution on
dedicated hardware platforms. However, there are few works targeting on improving the inference
efficiency of TNNs, specially on general purpose computing platforms. In this paper, we propose to
co-design the underlying implementation and the quantization algorithm to achieve ideal efficiency
and accuracy simultaneously.
2
2 Proposed Method
2.1 Preliminary
As the inner product is one of the fundamental operations in convolution neural networks, which
consumes most of the execution time, we mainly focus on the acceleration of inner product in this
paper. It is worth to firstly review how the inner product between two quantized vectors are computed
in previous literature. For BNNs, in which both the weights and activation are binarized to {−1, 1},
the inner product between two length-N vectors x,y ∈ {−1, 1}N can be derived using bit-wise
operations:
x · y = 2 · popcount(xnor(x,y))−N. (1)
Furthermore, for quantization with more bits, the input vectors can be decomposed with a linear
combination of binary bases. For example, a M -bit vector x can be encoded as x =
∑m=M−1
m=0 x
m ·
2m where xm ∈ {−1, 1}N (in practical implementation, −1 and 1 are represented by 0 and 1
respectively). Similarly, for another K-bit vector y, we have y =
∑k=K−1
k=0 y
k · 2k, where yk ∈
{−1, 1}N . Based on the decomposition, the binary inner product specified in Eq. (1) can be used to
compute higher bit inner product. Generally, the inner product between two quantized vectors can be
formulated as
x · y =
M−1∑
m=0
K−1∑
k=0
αmβk(x
m  yk), (2)
where  is specially used to denote the binary inner product in formulation of Eq. (1), α ∈ RM
and β ∈ RK are scales to encode x and y, respectively. Specifically, αm = 2m and βk = 2k
are used in uniform fixed-point quantization [2, 44] while α and β become trainable scales for
non-uniform quantization [7, 26]. Considerable speedups can be achieved by Eq. (1) and Eq. (2)
because all the calculation can be realized with bit operations [11, 45, 39], thus completely eliminating
multiplications.
2.2 Motivations for Acceleration
If assuming the computational complexity of BNNs in Eq. (1) to be O(N), the computational
complexity for higher bit quantization in Eq. (2) becomes O(M ·K ·N). We can find that higher bit
quantization algorithms acquire better task accuracy at the cost of increased computational complexity.
In particular, for the TNN case, 2 bits are required for the data representation of {−1, 0, 1}. Thus the
computational complexity for ternary inner product is O(4N), which is the same with the standard 2-
bit counterpart, however, with one of the quantization levels wasted (2 bits can express 4 quantization
levels at most). As a result, the implementation in Eq. (2) makes TNNs less appealing to standard
2-bit models in practical.
To fully unleash the potential of TNNs, we further observe that the binary inner product in Eq. (1) is
the core for acceleration since its multiplication and accumulation are realized by the bit operators
xnor and popcount, respectively. As the input of the inner product in BNNs is restricted to {−1, 1},
the multiplication result is also within the set {−1, 1} , which we call the “non-overflow” property.
The multiplication result can be directly obtained via xnor between the input vectors and it owns
the attribute that only two states exist, thus popcount can be used to realize accumulation by simply
counting the number of state “1” (or state “−1”). As a result, Eq. (1) enables the same parallelism
degree 3 for xnor and popcount, with the ALU register fully utilized. Interestingly, we find the
ternary quantized values {1, 0,−1} also meets the “non-overflow” property. Thus, it is potential for
the TNNs to be executed in the same parallelism degree manner for the multiplication (i.e., xnor) and
accumulation (i.e., popcount) operations. Moreover, we can utilize this property to design a novel
ternary inner product implementation with a reduced complexity of O(2N).
2.3 Ternary Network Acceleration
We now elaborate the design of the fast ternary inner product implementation. First, it is worth noting
that the ternary values {−1, 0, 1} will be represented by the corresponding codec in the practical
3 The same parallelism degree indicates the data amount processed is the same per instruction. More
explanations are put in Section 5.1 in the appendix.
3
Table 1: The correspondence mapping between the quantized data space and the codec space
employed in the proposed solution. Both 2’b01 and 2’b10 are taken to represent the “0” value. The
design of the codec owns the attribute of “popcount(codec) = data + 1”.
data -1 0 0 1
codec 2’b00 2’b01 2’b10 2’b11
Table 2: True value table of ternary multiplication.
x(codec) · y(codec) = z(codec)
−1(2’b00) −1(2’b00) 1(2’b11)
−1(2’b00) 0(2’b01) 0(2’b10)
−1(2’b00) 0(2’b10) 0(2’b01)
−1(2’b00) 1(2’b11) −1(2’b00)
0(2’b01) −1(2’b00) 0(2’b10)
0(2’b01) 0(2’b01) 0(2’b01)
0(2’b01) 0(2’b10) 0(2’b01)
0(2’b01) 1(2’b11) 0(2’b01)
x(codec) · y(codec) = z(codec)
0(2’b10) −1(2’b00) 0(2’b01)
0(2’b10) 0(2’b01) 0(2’b10)
0(2’b10) 0(2’b10) 0(2’b10)
0(2’b10) 1(2’b11) 0(2’b10)
1(2’b11) −1(2’b00) −1(2’b00)
1(2’b11) 0(2’b01) 0(2’b01)
1(2’b11) 0(2’b10) 0(2’b10)
1(2’b11) 1(2’b11) 1(2’b11)
implementation. We elaborately design the mapping between the logical level ternary values and
their implementation level codec as illustrated in Table 1. Interestingly, we observe a property that
the number of “1” in each value’s codec equals the value plus one. Therefore, we can compute the
inner product between two ternary vectors x,y ∈ {−1, 0, 1}N as follows:
x · y = popcount(TM(x,y))−N, (3)
where TM(·) indicates the ternary multiplication and N is the vector length. In Eq. (3), we first
compute the inner product in the codec space by popcount(TM(x,y)), which consists of pure bit
operations. Then we simply subtract N from the result to transform it into the inner product of the
logical level ternary vectors.
Second, we illustrate the way to design TM(·) with pure bit operations4. For easy understanding, the
true value table of ternary multiplication is listed in Table 2. Since two bits are required to encode the
ternary input value and the ternary multiplication result, there are 16 possibilities with respect to the
codec. From Table 2, we observe that most rows in the table, except the bold ones, still follow the
rule of xnor:
xnor = ∼ (x ∧ y), (4)
Therefore, the ternary multiplication can be realized by xnor along with the exception cases fixed.
After reviewing the xnor correct cases and exception cases, we summarize that the exception ones
only happen when both of the operands are 0. Therefore, the ternary multiplication result can be
fixed by locating the “zero operand” cases and forcing the result to be 0 (we force the result to be 0 if
“zero operand” is detected no matter whether it is the xnor incorrect case or not).
Overall, our implementation of the ternary multiplication consists of three steps: 1): Obtain the
intermediate result by the xnor operation. 2): Identify the 0 operand. 3): Fix the exceptions and
obtain the ternary multiplication result. The first step is easily achieved by Eq. (4). For the second
step, we here introduce an auxiliary variable auxi which is a pre-defined constant in codec 2′b01. In
fact, the auxiliary variable auxi indicates a zero value variable (one codec of the 0 value is 2′b01),
which acts as a mask to fetch specific bits in operands. Then we propose to identify the 0 operand
using the following bit operations:
switch = ((y >> 1) & auxi) | ((y << 1) & ∼ auxi), (5)
mask = switch ∧ y, (6)
With the shift and mask operations, Eq. (5) actually results in the exchange of the two sequence bits
in the operand. After that, Eq. (6) generates the mask information by distinguishing whether the
operand is 0 (in codec 2′b01 or 2′b10) or not. Specifically, the mask variable in Eq. (6) will be 2′b11
4We follow the C/C++ grammar in the equations. For example, & means “AND”, ∼ indicates “NOT”, ∧
represents “XOR”.
4
if 0 operand is detected and 2′b00 otherwise. It should be noted that the detection of the 0 operand
can be applied on either the first operand or the second operand, which means all the y in Eq. (5) and
Eq. (6) can be replaced by x. After identifying the 0 operand, we can easily fix the exceptions and
obtain the final ternary multiplication result as:
TM(·) = (mask & auxi) | ((∼ mask) & xnor). (7)
If the mask is 2′b11 (0 operand is detected), then Eq. (7) reduces to auxi which equals 0. In contrast,
if the mask is 2′b00, then Eq. (7) becomes xnor, where the correctness can be examined in Table 2.
Remark 1 We can derive from Eq. (3) that the computational complexity of the proposed ternary
inner product is O(2N). In other words, FATNN can reduce the computational complexity of TNNs
from O(4N) to O(2N), which significantly improves the efficiency (memory consumption is not
changed). Therefore, even though TNNs do not make full use of the 2-bit representational capacity,
they fortunately enjoy the faster implementation than the standard 2-bit models. Besides, our solution
can adapt to general purpose computing platforms, such as CPU, GPU and DSP. Note that the extra
bit operations introduced in Eqs. (5), (6) and (7) have negligible runtime overhead for deployment.
On the one hand, these extra bit operations are much faster than the accumulation operation in Eq. (3).
On the other hand, if y in Eqs. (5) and (6) represents model weights, the auxi, switch and mask
variables can all be pre-determined without additional runtime cost. We further provide extensive
benchmark results in Section 3.1 to justify our analysis.
Constraints on the algorithm. From the formulation discussed above, it can be learnt that the
designed ternary implementation has certain requirements on the quantization algorithm. The
constraints are summarized as follows:
• The ternary values for the network are limited to {−1, 0, 1}. One and only one additional
high precision coefficient is allowed to adjust the scale of the quantized values. More than
one scale coefficients will break Eq. (3). It indicates methods such as TTN [20], in which
two trainable variables (Wp,Wn) are learnt, cannot be applied to our method.
• A special case exists for the activation quantization when the ReLU non-linearity is applied,
which leads to a non-negative data range. In this situation, we advise to modify the ternary
values to {0, 1, 2} for activations. The revision does not conflict with the first constraint as
in the inference procedure, it results in an additional constant on the output (simply add a
copy of weights on the result which are fixed after training).
2.4 Non-uniform Step Size Quantization
(a) (b)
0
1
21
-1
Distribution
Quantization
0
Figure 1: The proposed non-uniform ternary quantization for (a) a tensor in the real domain; (b) a
tensor which only contains non-negative values. The vertical axis represents the quantized domain
and the horizontal axis denotes the real domain. The green curve indicates the distribution of the
full-precision tensor and the blue line shows the quantized values by discretizing the full precision
data according to the learned thresholds. We aim to learn the optimal step size of each quantization
level.
Now we start to design the ternary quantization algorithm under above constraints. Let us first
consider the b-bit quantization for an unsigned tensor, where the valid quantized values include
{0, 1, · · · , 2b − 1}. Then the quantization thresholds are assumed to be {0.5, 1.5, · · · , 2b − 1.5} in
previous works [2, 21, 9, 7], which means the quantizer step size is uniform for all quantization levels.
However, this places no guarantees on accurately matching the statistics of the data distribution. To
tackle this problem, current solutions are based on either learning a transformation function that
5
occurs completely prior to the discretization itself [8] or relaxing the round/ceil to a combination
of sigmoid functions [34]. However, extra functions and hyperparameters are introduced in these
methods, requiring a very careful tuning during the optimization procedure. In this paper, we instead
propose a simple yet efficient solution to learn the step size of each quantization level automatically
by stochastic gradient descent.
Let us first consider the ternary weight quantization as illustrated in Figure 1 (a). In particular, we
parameterize the three step sizes by introducing two learnable parameters {α1, α2}. In this way, the
full-precision data is partitioned into three levels with the quantization thresholds {−α1/2, α2/2},
where each step size can be adjusted accordingly during training. Then we define the weight quantizer
Qw(p;α1, α2) for a tensor p parameterized by the scale factors {α1, α2}. Specifically, Qw(p;α1, α2)
performs quantization by applying three point-wise operations in order: normalization, saturate and
round.
Normalization: Since we do non-uniform quantization, the tensor elements are firstly normalized by
the scale factors α1 and α2, respectively. This operation aims to map the data from the floating-point
domain to the quantized domain. Saturate: Once normalized , the tensor elements that are out of
the range of the quantized domain are clipped accordingly: clip(p, β1, β2) = min(max(p, β1), β2),
where the scalar clipping limits {β1, β2} are independent with the full-precision data range. Round:
We discretize the normalized and tailored tensor elements to nearest integers using bankers rounding
denoted by b·e. Putting the above point-wise operations together, the weight quantization function
can be written as:
Qw(pi) =
{ bclip(pi/α1,−1, 0)e if pi < 0
bclip(pi/α2, 0, 1)e otherwise , (8)
where pi is the i-th element in the tensor p. For simplicity, Eq. (8) can be re-written into an unified
form:
Qw(pi) = bclip(pi/α1,−1, 0)e+ bclip(pi/α2, 0, 1)e. (9)
Since the bankers rounding is non-differentiable, we use the straight through estimator (STE) [46] to
approximate the gradient through the round function b·e as a pass through operation, and differentiat-
ing all other operations normally [47].
For activation quantization, there exists two cases. On the one hand, if p is in the real domain (e.g.,
use PReLU or Tanh as non-linearity), the activation quantizer is the same as Qw(·). On the other
hand, if p is in the non-negative domain (applicable to ReLU activations), the quantized values should
be adapted accordingly. In this case, the activation quantizer becomes:
Qa(pi) = bclip(pi/α1, 0, 1)e+ bclip(pi − α1)/α2, 0, 1)e. (10)
Note that we learn independent step sizes for weights and activations of each quantized layer.
Remark 2 The proposed non-uniform step size quantization algorithm can be easily integrated in
the general deep learning training framework. It is worth noting that the algorithm meets all the
constraints derived in Section 2.3, which only have requirements on the quantized values while
having no requirements on the quantization thresholds. Moreover, the proposed ternary quantization
algorithm can be potentially generalized to any bit quantization. More discussions of the proposed
quantization algorithm are put in Section 5.2 in the appendix.
3 Experiments
3.1 Acceleration
In this paper, we propose a fast ternary inner product implementation with O(2N) complexity,
which is potentially 2× faster than the implementation in previous works [2, 7]. To further justify
its effectiveness in practice, we develop the acceleration code (in C++ and OpenCL) for binary,
ternary and 2-bit convolutions 5. Measurements of the actual execution time on different devices are
conducted. We cover both the embedded-side devices (Qualcomm 821 and Qualcomm 8356) and
5We put the implementation details in Section 5.1 in the appendix.
6We conduct experiments on Google Pixel and Xiaomi 6 phones which equip with the Qualcomm 821/835
chips respectively. Any platform with the same chip is expected to produce the similar result.
6
1.5 1.4
1.7 1.8
1.9
2.2
2.5
1.3
1.7
1.9
2.2
1.8
2.6 2.6
1.3
1.5 1.5
1.3
1.7 1.8 1.8
3.3
2.9 3 3
3.2
3.5
3.8
3.2 3.2
3.4 3.4
3.2 3.3
4
4.2 4.1
4.4 4.4
4.2 4.2 4.1
0
1
2
3
4
5
w:28
h:28
c:64
w:56
h:56
c:64
w:112
h:112
c:64
w:224
h:224
c:64
w:56
h:56
c:128
w:56
h:56
c:256
w:56
h:56
c:512
Q821-bin-vs-ter Q835-bin-vs-ter 1080Ti-bin-vs-ter
Q821-bin-vs-2bit Q835-bin-vs-2bit 1080Ti-bin-vs-2bit
Figure 2: Speedup ratios of binary inner product against the ternary and 2-bit counterparts.
Q821/Q835/1080Ti indicate the devices of Qualcomm 821, 835 and Nvidia 1080Ti, respectively.
Data is grouped according to layer shapes (c: channel, w: width, h: height). “bin” denotes binary and
“ter” means ternary.
server-side devices (Nvidia 1080Ti and Nvidia 2080Ti) to demonstrate the flexibility of our solution.
We keep the experimental setting the same across all devices.
First, we report the layer-wise speedups in Figure 2, where convolution layers (kernel size = 3× 3,
padding = 1, stride = 1, the same number of input and output channels and batch size = 1) with
seven different shape configurations are tested. For the first four cases in Figure 2, we fix the channel
number to be 64 and increase the resolution from 28 to 224. For the last three cases, we fix the
resolution to be 56 and double the channel number from 64 to 512. We report the relative acceleration
ratios between binary, ternary and 2-bit models on Qualcomm 821, 835 and Nvidia 1080Ti platforms
in Figure 2. Moreover, we put the exact per-layer execution time, the results on Nvidia 2080Ti as
well as the influence of the extra bit operations discussed in Remark 1 in Section 5.4 in the appendix.
From Figure 2, we observe that, compared with the binary quantization, the proposed ternary
quantization is less than 2× slower in most cases. Specifically, the ternary inner product is 1.4×
to 2.5×, 1.3× to 2.6× and 1.3× to 1.8× slower than the binary one on Qualcomm 821, 835 and
Nvidia 1080Ti, respectively. In contrast, the 2-bit quantization is at least 2.9× slower than the binary
counterpart. In some cases, the 2-bit quantization is even 4.4× slower than the binary one. It is
obvious that our FATNN has superior speed over the 2-bit models, with about 2× speedup ratio.
Besides the layer-wise analysis, it is also interesting to get a knowledge of the overall speed for
some classical networks. We present the whole execution time of all quantized layers (first and last
layers are excluded while non-linear and skip connection layers are included) for ResNet-18 and
ResNet-34 on ImageNet in Table 3. We set the batch size to 1. From the table, we can learn that both
the 2-bit and ternary models run faster than the theoretical speedup versus the binary counterpart (2
and 4 respectively). Moreover, the speedup of the ternary models compared with the 2-bit ones (last
column) again demonstrates that our proposed FATNN can run about 2× faster than the 2-bit models
or conventional TNNs.
Table 3: Exact execution time (ms) and speedup ratios for overall quantized layers. “bin” means
binary and “ter” is ternary. We run 5 times and report the results with mean and standard deviation.
Device Network bin ter 2-bit bin vs. ter bin vs. 2-bit ter vs. 2-bit
Q835 ResNet-18 12.1±0.2 20.1±0.2 43.0±0.7 1.7 3.6 2.1ResNet-34 25.3±0.3 42.1±0.5 87.0±1.5 1.7 3.4 2.1
Q821 ResNet-18 15.7±0.3 25.2±0.3 52.3±1.0 1.6 3.3 2.1ResNet-34 32.3±0.6 51.4±0.6 105.2±2.1 1.6 3.3 2.0
7
3.2 Quantization Accuracy
3.2.1 Experiment Setup
To evaluate the performance of the proposed quantization algorithm, we compare the quantization
result with several state-of-the-art methods. We evaluate the accuracy of the algorithms on ImageNet
[48].
For ImageNet classification, all the images are re-scaled with the shorter edge to be 256. Training
images are then randomly cropped into resolution of 224×224. After that, the images are normalized
using the mean and standard deviation. No additional augmentations are performed except the random
horizontal flip. Validation images follow a similar procedure except the random crop is replaced
with the center crop and no flip is applied. We conduct experiments on the vanilla ResNet models
[49]. Similar with previous works [2, 7], we do not quantize the first and last layers. If not specially
mentioned, the initial learning rate is set to be 1e-2 and the cosine annealing decay is employed.
Other default hyper-parameters include: SGD optimizer with momentum of 0.9, weight decay to be
2e-5, and the maximum training epochs to be 90. The quantization related parameters α1 and α2 are
initialized to be 1.0 for weights and activations in all quantized layers. We initialize the quantized
network with the pretrained full-precision weights at the beginning of the quantization.
3.2.2 Evaluation on ImageNet
We list the quantization results of the proposed FATNN on a series of ResNet architectures in Table 4.
Note that there are few quantization algorithms specially tailored to TNNs. To make a fair comparison,
we implement the current best performed quantization algorithm LSQ [9] and reproduce the accuracy
reported in the original paper. From Table 4, we observe steady Top-1 accuracy improvement
of our FATNN over LSQ on all comparing architectures in the ternary case. This result strongly
justifies the effectiveness of the proposed non-uniform step size quantization strategy. Moreover,
it also demonstrates that our FATNN can achieve the state-of-the-art accuracy while boosting the
implementation speed of conventional TNNs by 2×.
Besides, as the “2-bit activation and binary weight” networks also have a similar computational
complexity O(2N) with our FATNN, we introduce the quantization results from several other
algorithms for further comparison. We can learn from Table 4 that our FATNN is able to achieve
more than 2% Top-1 accuracy gain on ResNet-18 over the non-uniform quantization algorithms, such
as LQ-Net [7], HWGQ [24] and QN [34]. Moreover, compared to the uniform step size quantization
algorithms, including LSQ and DoReFa-Net [2], we also achieve the best performance on various
architectures. This further shows that our superior FATNN generalizes well on the large scale dataset.
Table 4: Accuracy (%) comparisons between our FATNN and other algorithms. “A/W” in the second
column indicates the bit configuration for activations and weights respectively. “ter” denotes ternary.
Results for LSQ are based on our own implementation. Results for algorithms, including TBN,
LQ-Net, HWGQ, DoReFa-Net and QN, are directly cited from the original papers.
Method A/W
ResNet-18 ResNet-34 ResNet-50 ResNet-101
Top-1 Top-5 Top-1 Top-5 Top-1 Top-5 Top-1 Top-5
32/32 69.8 89.1 73.3 91.4 76.1 92.9 77.4 93.5
FATNN(Ours) ter/ter 65.4 86.2 69.5 88.9 71.6 90.3 74.3 91.8
LSQ [9] ter/ter 64.7 85.6 69.0 88.8 71.2 90.1 73.7 91.5
TBN [41] ter/1 55.6 79.0 58.2 81.0 - - - -
LSQ [9] 2/1 64.9 85.8 69.1 88.8 71.0 90.0 - -
LQ-Net [7] 2/1 62.6 84.3 66.6 86.9 68.7 88.4 - -
HWGQ [24] 2/1 59.6 82.2 64.3 85.7 64.6 85.9 - -
DoReFa-Net [2] 2/1 53.4 - - - - - - -
QN [34] 2/1 63.4 84.9 - - - - - -
4 Conclusion
In this paper, we have proposed a fast ternary neural network, named FATNN. Specifically, we
emphasize that the underlying implementation and the quantization algorithm are highly correlated
8
and should be co-designed. From the implementation perspective, we exploit the “non-overflow”
property to design a novel ternary inner product with fully bit operations. As a result, our FATNN
can achieve 2× less complexity than the conventional TNNs. Moreover, we have designed an
efficient quantization algorithm in accordance with the constraints of the implementation. Extensive
experiments have justified that FATNN improves the previous TNNs in aspects of both execution
time and quantization accuracy. Therefore, we advocate to rethink the value of TNNs and believe
that FATNN will serve as a strong benchmark for further research.
5 Appendix
5.1 Acceleration Details
As a fundamental operation in convolutional neural networks, the vector inner product is one of the
core components in acceleration. In this section, we first elaborate more about the “non-overflow”
property and the same parallelism degree as well as their importance to the fast implementation. Then
we present the detailed implementation of the proposed fast ternary inner product and introduce its
application in the convolutional and fully-connected layers.
Non-overflow property: If the range of input quantized vectors and the multiplication result in
the inner product keeps the same, we term it “non-overflow” property, as mentioned in Section 2.2.
Binary quantized values {1,−1} and ternary quantized values {1, 0,−1} are examples. We attribute
the fast implementation of our FATNN to the “non-overflow” property because it enables the same
parallelism degree for the multiplication (i.e., xnor) and accumulation (i.e., popcount) operations.
Specifically, for the BNNs case, 8 full-precision values can be packed into one byte. When xnor is
employed for the multiplication, the parallelism degree is 8. Moreover, popcount also accumulates 8
data at the same time (same parallelism degree with the xnor operation). For the TNNs case, only 4
full-precision values can be packed into one byte. Thus, the parallelism degree of our TM(·) in Eq. (3)
is 4. Interestingly, the popcount operation can also accumulate the 4 data simultaneously. However,
if standard 2-bit quantization values {0, 1, 2, 3} are leveraged, the multiplication by combination of
bit-wise operators (such as xnor) can own the parallelism degree of 4 if packed in byte. However, 4
bits are required to encode the multiplication result. Thus, the parallelism degree for the accumulation
procedure is 2 at most (less than 4). Consequently, the computational efficiency will be halved. Based
on the analysis, it can be learnt that the “non-overflow” property is an important attribute to enable
the fast implementation of ternary and binary networks.
Algorithm 1: Fast Ternary Inner Product
Input: (1): Full-precision weight vector w ∈ RN and activation vector a ∈ RN . (2): Pre-allocated
temporary buffer wˆ and aˆ with unsigned char type in the length of N/4. (3): Quantization
parameters αw1 , α
w
2 and α
a
1 , α
a
2 which are used to parameterize the step sizes for weights and
activations, respectively.
Output: The ternary inner product result z.
1 // Step (1): data packing;
2 for i← 0 to N−14 do
3 Pack 4 values of a[4i : 4i+ 3] into one unsigned char and store it in aˆ[i];
4 Pack 4 values of w[4i : 4i+ 3] into one unsigned char and store it in wˆ[i];
5 end
6 // Step (2): ternary inner product;
7 acc = 0;
8 for i← 0 to N−14 do
9 acc += popcount(TM(wˆ[i], aˆ[i]));
10 end
11 z = acc−N
As explained in Section 2.3, we design a fast implementation for TNNs by exploiting the “non-
overflow” property. Algorithm 1 summarizes the inference flow of the proposed fast ternary inner
product. In Algorithm 1, lines 2 ∼ 5 quantize the full-precision input vector into the codec of the
quantized values. As 2 bits are required to encode one ternary value, 4 full-precision values can be
9
packed into one byte. It is worth noting that it is also possible to pack the full-precision data into other
data structures. For example, 8 full-precision data can be packed into short type or 16 full-precision
data can be packed into 32-bit int type. More specifically, during the packing, each full-precision
data is compared with the corresponding quantization thresholds characterized by {α1, α2} and
assigned to the corresponding codec value. After that, based on Eqs. (5), (6), (7), the accumulation is
performed as lines 8 ∼ 10 in Algorithm 1. Finally the logic level inner product result is obtained
according to Eq. (3) in line 11.
Following the implementation of the fast ternary inner product, the convolutional layer can be
realized by first expanding the input activation into a matrix (im2col) and then conducting the matrix
multiplication (gemm). To enhance the efficiency, the data packing in Algorithm 1 is integrated into
the im2col operation. After that, the matrix multiplication is realized based on the inner product step
in Algorithm 1. Tricks, such as winograd [50], are commonly employed in the gemm operation,
however we do not integrate these tricks for simplicity. The fully-connected layer is similar to the
implementation of the convolutional layer, which can be regarded as a special case of the latter with
kernel size being equal to the feature map resolution.
Other operations, such as the ReLU non-linearity and skip connection layers, can be fused in
the im2col procedure. Besides, we fuse the batch normalization layers into the corresponding
convolutional or fully-connected layers.
5.2 Visualization of Non-uniform Step Sizes
In this paper, we discretize the full-precision tensor into the ternary quantized values with trainable
non-uniform step sizes. For each quantized layer, we learn two parameters α = {α1, α2} for weights
and activations separately. As illustrated in Figure 1, the quantization thresholds are directly related
to the learnt parameters α. When the two scale factors are identical (α1 = α2), the proposed
quantization algorithm reduces to the uniform step size quantization [9, 2, 8]. Thus, it is interesting
to investigate the properties of the learnt α. We plot the distribution of the full-precision weights
and activations as well as the corresponding quantization thresholds on ResNet-18 in Figure 3 and 4,
respectively. We list the statistics of four layers in ResNet-18. Besides, the max value, min value and
the two quantization thresholds of the tensors are marked in each sub-figure. On the one hand, from
Figure 3, we observe that the distribution of the weight in the model varies a lot in different layers.
In order to reduce the information loss during quantization, we propose to learn the quantization
thresholds automatically to better fit the data distribution. On the other hand, Figure 4 demonstrates
that, the full-precision activations consist of dense relative small values and sparse relative large
values. The number of elements of each interval is unbalanced. It can be seen from Figure 4 that
the quantization step sizes learnt based on the stochastic gradient descent are non-uniform ones and
differ at different layers.
5.3 Ablation Study on ImageNet
We further explore the effect of weight decay on our FATNN. We change the weight decay to 1e-5 and
evaluate the corresponding performance on various architectures in Table 5. From the table, we find
that the modified weight decay has limited influence on the performance on ResNet-18, ResNet-34
and ResNet-50.
Table 5: Impact of the weight decay on the proposed FATNN accuracy (%).
Network Weight Decay Top-1 Top-5
ResNet-18 1e-5 65.4 86.22e-5 65.4 86.2
ResNet-34 1e-5 69.4 88.92e-5 69.5 88.9
ResNet-50 1e-5 71.5 90.22e-5 71.6 90.3
5.4 More Execution Time Benchmarks
We present more acceleration benchmark results in this section.
10
0.20 0.15 0.10 0.05 0.00 0.05 0.10
0.0
0.2
0.4
0.6
0.8
1.0
1e4
2/2
1/2
max
min
(a) layer1_1_conv2
0.15 0.10 0.05 0.00 0.05 0.10 0.15
0.0
0.5
1.0
1.5
2.0
2.5
1e4
2/2
1/2
max
min
(b) layer2_1_conv2
0.10 0.05 0.00 0.05 0.10
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1e5
2/2
1/2
max
min
(c) layer3_1_conv2
0.050 0.0250.000 0.025 0.050 0.075 0.100 0.125 0.150
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
1e5
2/2
1/2
max
min
(d) layer4_1_conv2
Figure 3: Weight quantization for ResNet-18.
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1e7
1 + 2/2
1/2
max
(a) layer1_1_conv2
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0
2
4
6
8
1e6
1 + 2/2
1/2
max
(b) layer2_1_conv2
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0
1
2
3
4
5
1e6
1 + 2/2
1/2
max
(c) layer3_1_conv2
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
1e6
1 + 2/2
1/2
max
(d) layer4_1_conv2
Figure 4: Activation quantization for ResNet-18.
11
First, for convenience of comparison, we list the the exact execution time for the layer-wise benchmark
described in Section 3.1 in Table 6 and Table 7, respectively. We use Q821/Q835/1080Ti/2080Ti to
indicate the Qualcomm 821, 835 and Nvidia 1080Ti, 2080Ti separately. Besides, “bin” represents
binary and “ter” means ternary in Table 6 and Table 7. As mentioned in Section 3.1, we measure the
layer-wise execution time for convolution layers (kernel size = 3×3, padding = 1, stride = 1, the same
number of input and output channels and batch size = 1) with seven different shape configurations.
For the first four cases, we fix the channel number to be 64 and increase the resolution from 28 to 224.
For the last three cases, we fix the resolution to be 56 and double the channel number from 64 to 512.
Second, in Remark 1, we discuss about an alternative solution for the ternary implementation. By
taking the model weights to be the y variable in Eqs. (5) and (6), some intermediate variables in
Eqs. (5), (6) and (7) can be pre-computed to reduce the computational burden. However, comparing
with the on-the-fly computing in Eqs. (5) and (6), it requires extra memory loading time for the
pre-computed variables. To clarify the impact, we test the exact execution time of the two modes on
different devices in Table 8. We observe in Table 8 that the execution time difference between the two
modes is small and the optimal choice is highly dependant on the running platforms and layer shapes.
Table 6: Exact execution time (ms) on embedded-side platforms. We run 5 times and report the
mean results.
Device A/W case1 case2 case3 case4 case5 case6 case7
Q821
bin/bin 0.6 1.2 2.6 8.3 2.3 6 19.9
ter/ter 0.9 1.7 4.4 14.8 4.2 13.5 49
2/2 1.9 3.5 8 24.9 7.2 21.2 75.5
Q835
bin/bin 0.7 0.9 1.9 6.1 1.8 4.8 17.4
ter/ter 0.9 1.5 3.7 13.2 3.2 12.7 45.4
2/2 2.1 3 6.5 20.5 5.6 15.8 69.8
Table 7: Exact execution time (µs) on server-side platforms. We run 5 times and report the mean
results.
Device A/W case1 case2 case3 case4 case5 case6 case7
1080Ti
bin/bin 9.3 17.7 45.3 142.7 32.7 97 318
ter/ter 12 25.7 68 188.8 54.3 177.7 574
2/2 39.3 72.6 200.4 621.8 136.8 405.1 1290.9
2080Ti
bin/bin 11 12.5 19.5 56.5 22 55.5 136
ter/ter 11 14.5 26 90 34 87 267.5
2/2 38 44 73.5 274 91 217.5 551.5
Table 8: Exact execution time (µs) of the two variants of ternary implementations. “D” in the second
column represents the on-the-fly computing and “P” indicates pre-computing certain variables and
loading them during runtime. We run 5 times and report the results with mean and standard deviation.
Bold ones are the faster cases in the two modes.
Device Mode case1 case2 case3 case4 case5 case6
Q821 D 594 ± 28 1190 ± 14 3425 ± 8 12373 ± 29 3631 ± 27 13202 ± 3P 655 ± 7 1246 ± 17 3461 ± 10 12394 ± 24 3536 ± 32 12526 ± 13
Q835 D 485 ± 4 1007 ± 3 2847 ± 28 11023 ± 26 2606 ± 19 12122 ± 46P 640 ± 5 1157 ± 8 3318 ± 4 11200 ± 1529 2611 ± 14 11854 ± 55
1080Ti D 12 ± 0.6 25.7 ± 0.6 68 ± 0.6 188.3 ± 2 54.3 ± 0.6 177.7 ± 0.6P 14.7 ± 0.6 30 ± 0.6 84.7 ± 0.6 273.7 ± 3 65.7 ± 0.6 237.3 ± 0.6
2080Ti D 12 ± 1.4 15.5 ± 0.7 29 ± 0.7 92.5 ± 2.1 35.5 ± 2.1 88.5 ± 0.7P 11 ± 0 14.5 ± 0.7 26 ± 0.7 90 ± 0.7 34 ± 0.7 87 ± 0.7
References
[1] Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid, “Towards effective
low-bitwidth convolutional neural networks,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,
2018.
12
[2] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou, “Dorefa-net:
Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv
preprint arXiv:1606.06160, 2016.
[3] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf, “Pruning filters for
efficient convnets,” arXiv preprint arXiv:1608.08710, 2017.
[4] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou
Huang, and Jinhui Zhu, “Discrimination-aware channel pruning for deep neural networks,” in
Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 875–886.
[5] Barret Zoph and Quoc V Le, “Neural architecture search with reinforcement learning,” in Proc.
Int. Conf. Learn. Repren., 2017.
[6] Hanxiao Liu, Karen Simonyan, and Yiming Yang, “Darts: Differentiable architecture search,”
in Proc. Int. Conf. Learn. Repren., 2019.
[7] Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua, “Lq-nets: Learned quantiza-
tion for highly accurate and compact deep neural networks,” in Proc. Eur. Conf. Comp. Vis.,
2018.
[8] Sangil Jung, Changyong Son, Seohyung Lee, Jinwoo Son, Jae-Joon Han, Youngjun Kwak,
Sung Ju Hwang, and Changkyu Choi, “Learning to quantize deep networks by optimizing
quantization intervals with task loss,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2019, pp.
4350–4359.
[9] Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dhar-
mendra S. Modha, “Learned step size quantization,” in Proc. Int. Conf. Learn. Repren.,
2020.
[10] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi, “Xnor-net: Imagenet
classification using binary convolutional neural networks,” in Proc. Eur. Conf. Comp. Vis., 2016,
pp. 525–542.
[11] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio, “Bina-
rized neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 4107–4115.
[12] Tianli Zhao, Xiangyu He, Jian Cheng, and Jing Hu, “Bitstream: Efficient computing architecture
for real-time low-power inference of binary neural networks on CPUs,” in Proc. ACM Int. Conf.
Multimedia, 2018, pp. 1545–1552.
[13] Joseph Bethge, Marvin Bornstein, Adrian Loy, Haojin Yang, and Christoph Meinel, “Training
competitive binary neural networks from scratch,” arXiv preprint arXiv:1812.01965, 2018.
[14] Joseph Bethge, Haojin Yang, Christian Bartz, and Christoph Meinel, “Learning to train a binary
neural network,” arXiv preprint arXiv:1809.10463, 2018.
[15] Haojin Yang, Martin Fritzsche, Christian Bartz, and Christoph Meinel, “Bmxnet: An open-
source binary neural network implementation based on mxnet,” in Proc. of the ACM Int. Conf.
on Multimedia. ACM, 2017, pp. 1209–1212.
[16] Wei Tang, Gang Hua, and Liang Wang, “How to train a compact binary neural network with
high accuracy?,” in Proc. AAAI Conf. on Arti. Intel., 2017, pp. 2625–2631.
[17] Yiwen Guo, Anbang Yao, Hao Zhao, and Yurong Chen, “Network sketching: Exploiting binary
structure in deep cnns,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017, pp. 5955–5963.
[18] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng, “Bi-real
net: Enhancing the performance of 1-bit cnns with improved representational capability and
advanced training algorithm,” in Proceedings of the European Conference on Computer Vision
(ECCV), 2018, pp. 722–737.
[19] Fengfu Li, Bo Zhang, and Bin Liu, “Ternary weight networks,” arXiv preprint
arXiv:1605.04711, 2016.
[20] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally, “Trained ternary quantization,” in
Proc. Int. Conf. Learn. Repren., 2017.
[21] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi
Srinivasan, and Kailash Gopalakrishnan, “Pact: Parameterized clipping activation for quantized
neural networks,” arXiv preprint arXiv:1805.06085, 2018.
13
[22] Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid, “Structured binary
neural network for accurate image classification and semantic segmentation,” in Proc. IEEE
Conf. Comp. Vis. Patt. Recogn., 2019.
[23] Xiaofan Lin, Cong Zhao, and Wei Pan, “Towards accurate binary convolutional neural network,”
in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 344–352.
[24] Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos, “Deep learning with low precision
by half-wave gaussian quantization,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017, pp.
5918–5926.
[25] Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee, “Learning low precision deep neural
networks through regularization,” ArXiv, vol. abs/1809.00095, 2018.
[26] Chaim Baskin, Eli Schwartz, Evgenii Zheltonozhskii, Natan Liss, Raja Giryes, Alex M Bron-
stein, and Avi Mendelson, “Uniq: Uniform noise injection for non-uniform quantization of
neural networks,” arXiv preprint arXiv:1804.10969, 2018.
[27] Lu Hou and James T Kwok, “Loss-aware weight quantization of deep networks,” in Proc. Int.
Conf. Learn. Repren., 2018.
[28] Ruizhou Ding, Ting-Wu Chin, Zeye Liu, and Diana Marculescu, “Regularizing activation
distribution for training binarized deep networks,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,
2019, pp. 11408–11417.
[29] Yu Bai, Yu-Xiang Wang, and Edo Liberty, “Proxquant: Quantized neural networks via proximal
operators,” in Proc. Int. Conf. Learn. Repren., 2019.
[30] Asit Mishra and Debbie Marr, “Apprentice: Using knowledge distillation techniques to improve
low-precision network accuracy,” in Proc. Int. Conf. Learn. Repren., 2018.
[31] Eunhyeok Park, Junwhan Ahn, and Sungjoo Yoo, “Weighted-entropy-based quantization for
deep neural networks,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017, pp. 5456–5464.
[32] Antonio Polino, Razvan Pascanu, and Dan Alistarh, “Model compression via distillation and
quantization,” in Proc. Int. Conf. Learn. Repren., 2018.
[33] Christos Louizos, Matthias Reisser, Tijmen Blankevoort, Efstratios Gavves, and Max Welling,
“Relaxed quantization for discretized neural networks,” in Proc. Int. Conf. Learn. Repren., 2019.
[34] Jiwei Yang, Xu Shen, Jun Xing, Xinmei Tian, Houqiang Li, Bing Deng, Jianqiang Huang, and
Xian-sheng Hua, “Quantization networks,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June
2019.
[35] Cong Leng, Hao Li, Shenghuo Zhu, and Rong Jin, “Extremely low bit neural network: Squeeze
the last bit out with admm,” AAAI, 2018.
[36] Peisong Wang, Qinghao Hu, Yifan Zhang, Chunjie Zhang, Yang Liu, and Jian Cheng, “Two-step
quantization for low-bit neural networks,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018,
pp. 4376–4384.
[37] Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc
Van Gool, “Ai benchmark: Running deep neural networks on android smartphones,” in Proc.
Eur. Conf. Comp. Vis., 2018.
[38] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan
Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al., “TVM: An automated end-to-end optimizing
compiler for deep learning,” in USENIX Symp. Operating Systems Design & Implementation,
2018, pp. 578–594.
[39] Yaman Umuroglu, Nicholas Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus
Jahre, and Kees Vissers, “Finn: A framework for fast, scalable binarized neural network
inference,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays. ACM, 2017, pp.
65–74.
[40] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard,
Hartwig Adam, and Dmitry Kalenichenko, “Quantization and training of neural networks for
efficient integer-arithmetic-only inference,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018.
[41] Diwen Wan, Fumin Shen, Li Liu, Fan Zhu, Jie Qin, Ling Shao, and Heng Tao Shen, “Tbn:
Convolutional neural network with ternary inputs and binary weights,” in Proc. Eur. Conf.
Comp. Vis., September 2018.
14
[42] Jianhao Zhang, Yingwei Pan, Ting Yao, He Zhao, and Tao Mei, “dabnn: A super fast inference
framework for binary neural networks on arm devices,” in Proc. ACM Int. Conf. Multimedia,
2019, pp. 2272–2275.
[43] Lei Deng, Peng Jiao, Jing Pei, Zhenzhi Wu, and Guoqi Li, “Gxnor-net: Training deep neural
networks with ternary weights and activations without full-precision memory under a unified
discretization framework,” Neural Networks, vol. 100, pp. 49–58, 2018.
[44] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy, “Fixed point quantization of deep
convolutional networks,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 2849–2858.
[45] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David, “Binaryconnect: Training deep
neural networks with binary weights during propagations,” in Proc. Adv. Neural Inf. Process.
Syst., 2015, pp. 3123–3131.
[46] Yoshua Bengio, Nicholas Léonard, and Aaron Courville, “Estimating or propagating gradients
through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
[47] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer, “Automatic differentiation in
pytorch,” in Proc. Adv. Neural Inf. Process. Syst. Workshops, 2017.
[48] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al., “Imagenet large scale visual
recognition challenge,” Int. J. Comp. Vis., vol. 115, no. 3, pp. 211–252, 2015.
[49] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image
recognition,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016, pp. 770–778.
[50] Athanasios Xygkis, Lazaros Papadopoulos, David Moloney, Dimitrios Soudris, and Sofiane
Yous, “Efficient winograd-based convolution kernel implementation on edge devices,” in
Proceedings of the 55th Annual Design Automation Conference, New York, NY, USA, 2018,
DAC ’18, pp. 136:1–136:6.
15
