Computation Error Analysis of Block Floating Point Arithmetic Oriented
  Convolution Neural Network Accelerator Design by Song, Zhourui et al.
Computation Error Analysis of Block Floating Point Arithmetic Oriented
Convolution Neural Network Accelerator Design
Zhourui Song1, Zhenyu Liu2 and Dongsheng Wang2
1 School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing, 100876, China
2 RIIT, Tsinghua University, Beijing, 100084, China
Email: songzr@bupt.edu.cn liuzhenyu73@mail.tsinghua.edu.cn wds@tsinghua.edu.cn
Abstract
The heavy burdens of computation and off-chip traffic im-
pede deploying the large scale convolution neural network
on embedded platforms. As CNN is attributed to the strong
endurance to computation errors, employing block floating
point (BFP) arithmetics in CNN accelerators could save the
hardware cost and data traffics efficiently, while maintaining
the classification accuracy. In this paper, we verify the effects
of word width definitions in BFP to the CNN performance
without retraining. Several typical CNN models, including
VGG16, ResNet-18, ResNet-50 and GoogLeNet, were tested
in this paper. Experiments revealed that 8-bit mantissa, in-
cluding sign bit, in BFP representation merely induced less
than 0.3% accuracy loss. In addition, we investigate the com-
putational errors in theory and develop the noise-to-signal ra-
tio (NSR) upper bound, which provides the promising guid-
ance for BFP based CNN engine design.
1 Introduction
Convolutional neural networks (CNNs) have achieved state-
of-art performance in many artificial intelligence tasks,
especially in image recognition (Ciregan, Meier, and
Schmidhuber 2012) (Russakovsky et al. 2015b), nature
language processing(Kim 2014)(Goldberg 2016), strategic
planning(Silver et al. 2016), etc. This success is partially
facilitated by the advance of computation infrastructure.
With GPU clusters, large-scale CNNs can be deployed
eventhough they are attributed as memory-and-computation-
intensive and resource-consuming(Li et al. 2016). However,
when deploying CNNs in data center, GPU clusters is not
the first preference because of the low power efficiency of
GPU. Therefore, promoting energy efficiency became one
prominent target in CNN accelerator design. Researchers
have been committed to exploring how to deploy CNNs on
FPGAs (Ovtcharov et al. 2015) , or designing AISCs(Jouppi
et al. 2017), as they prossesses higher energy efficiency due
to their specific architecture.
To transplant CNNs on FPGA, two serious issues, i.e.,
off-chip traffic bottleneck and huge amount of floating-point
arithmetics overhead, need to be addressed. The off-chip
traffic stems from that, for large scale networks, the fea-
ture maps and the network parameters must be stored in the
Copyright c© 2018, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
off-chip SDRAM. The frequent accesses to these datum in-
duces no-trivial bandwidth requirements. Secondly, as the
hardwired floating-point modules are not always equipped
in FPGA, employing the floating point operations in FPGA
CNN accelerator degrades both of the throughput and the
energy efficiency severely.
In this paper, we proposed a block floating point (BFP)
based convolution implementation. BFP representation can
be attributed as a special case of floating point representation
where numbers within a block share a common exponent.
Hence, BFP possesses the high dynamic range of floating
point representation as it has exponent part. On the other
hand, the computation complexity of two BFP blocks is re-
duced to the degree of integer representation. Experiments
in our paper revealed that even with 7-bit mantissa, we can
sacrifice less than 0.3% accuracy loss when implementing
the large scale network models, such as VGG-16(Simonyan
and Zisserman 2014), ResNet-18, ResNet-50(He et al. 2016)
and GoogLeNet(Szegedy et al. 2015). It should be noted
that, no retraining is required in our experiments. That is
the original models can be deployed in our BFP based ac-
celerator directly. Finally, this paper proposed an analytical
model to derive the NSR upper bound of the finite word-
length BFP arithmetics, which supports the verification of
hardwired CNN accelerator design efficiently .
The rest of this paper is organized as follow. Section 2
presented related works. Section 3 discussed the details of
applying BFP in CNNs. Section 4 expounded the theoreti-
cal NSR model of BFP. Section 5 verified the performance
of BFP oriented CNN on GoogLeNet, VGG-16, ResNet-18,
ResNet-50, mnist and cifar10. We also illustrated the effi-
ciency of our NSR model by using VGG-16 network. Sec-
tion 6 summarized the whole paper.
2 Related Works
Methods like data reusing, compression and trimming have
been developed to meet the bandwidth requirement of
FPGA. (Chen et al. 2017) (Karam et al. 2017) proposed a
row stationary data flow on 168 processing elements that im-
prove the energy efficiency by reusing data locally. (Zhang et
al. 2015) develop the roofline model to analyze the compu-
tation and memory requirements of a specified CNN model,
and then to identify the optimal solution on the provided
FPGA platform. Sparsifying CNN model’s parameters is an-
ar
X
iv
:1
70
9.
07
77
6v
2 
 [c
s.L
G]
  2
4 N
ov
 20
17
other popular solution. (Han, Mao, and Dally 2015) pro-
posed a three-stage compression method, namely pruning,
trained quantization and Huffman coding that significantly
reduced the size of DNNs without decrease in accuracy.
However, retraining of the sparse model is time consuming,
and the entropy decoding of model parameter causes addi-
tional delay when accessing these parameters. That is, the
CNN accelerator’s throughput is degraded. Trimming (Han
et al. 2015)(Parashar et al. 2017) also suffers from retrain-
ing as it is required to find the important connections and
abandon the others.
Researchers have committed to replacing 32-bit floating
point number format with fixed point number format. (Page
and Mohsenin 2016) utilized singular value decomposition
on dense layers and limited precision fixed-point representa-
tions of convolutional weights, at the cost of less than 0.2%
decrease in accuracy on MNIST, CIFAR-10 and SVHN with
3-bit integer and 6-bit fraction fixed point format. Rounding
model has also drawn attention. (Gupta et al. 2015) proposed
that deep networks can be trained in 16-bit fixed point rep-
resentation with stochastic rounding. However, the common
weakness of the above methods is that they all require retrain
to amended parameters, while retrain is very expensive. In
addition, when applied in deep neural networks, the quick
growth of word width requirement consumes more chip
area, power, and bandwidth, which becomes the hindrance
of employing integer arithmetic in complex network mod-
els. For example, (Hill et al. 2016) proved that GoogLeNet
acquires 40-bit fixed point representation to maintain an ac-
ceptable accuracy using stochastic rounding.
(Mellempudi et al. 2017) proposed a method that divide
weights and activations into clusters, and each cluster holds
a joint scaling factor. Therefore, the numbers in the same
cluster can be represented by a integer index. The subse-
quent convolution operation can be carried out in the integer
domain. They designed a system that utilizes 8-bit integer
achieving 6% decrease in ResNet-101 top-1 accuracy with-
out retraining. This scheme partly eliminated the floating
point operations. In specific, the scaling procedure is still
carried out with floating point arithmetics, which even in-
clude the divide and the root operations.
3 Block-Floating Point Arithmetic
Oriented CNN
3.1 Definition of Block Floating Point Arithmetic
With block floating point representation, n numbers belong-
ing to a block share the common scaling factor, i.e., the block
exponent. The block exponent is determined by the largest
magnitude in the block, and smaller ones will be right shifted
to align, which is called block formatting.
At first, we provide the associated nomenclature to clar-
ify our statement. For a cluster of numbers, denoted as X,
xi is the ith element of X, mi and ei are the mantissa and
exponent part of xi. WhenX is block formatted intoX′, the
mantissa part and block exponent is written asM′X and εX,
respectively.
For example, given a block X that contains N floating
numbers,X can be expressed as
X = (x1, · · · xi, · · · xN )
= (m1 × 2e1 , · · · , mi × 2ei , · · · , mN × 2eN )
With BFP representation, X is transformed to X′, which is
written as
X′ = (x′1, · · · x′i, · · · x′N )
= M′X × 2εX
where
M′X = (m
′
1, · · · m′i, · · · m′N )
εX = max
i
ei|i ∈ [1, N ]
εX is the maximum exponent in the block X and mi is the
aligned entry-wise mantissa that is derived with the follow-
ing method,
m′i = mi >> (εX − ei) (1)
where a >> b means right shifting a with b bits.
For CNN accelerator design, block-floating-point repre-
sentation possesses two advantages. First, the concise ex-
pression contributes to saving the memory and the traffic
bandwidth. If we have a block floating point format with
Le-bit exponent, Lm-bit mantissa, and one sign bit, the av-
erage length of n numbers is 1+Lm+Le/n, while floating
point representation costs 1+Lm+Le bits per number. The
shorter averaged bit-width per number contributes to sav-
ing both of the memory and the off-chip bandwidth require-
ments. In addition, with BFP, all multiply-accumulate oper-
ations in convolutional layer are carried out in fixed-point
format. The fixed-point arithmetic unit in FPGA and ASIC
design is much more efficient than the floating point one. For
example, a 32-bit fixed-point adder in FPGA Virtex 7 690t
consumes 1DSP with 300MHz clock speed. In contrast, a
16-bit 4-stage -pipeline floating-point adder is constituted of
2 DSPs and 117 LUT with 219MHz working frequency.
Acceleration in additions and multiplications is achieved
at the cost of lower computation accuracy than floating-
point counterpart because the small numbers in the block
sacrificed more valid bits during the block based aligning
procedure as shown in equation (1). The errors during the
BFP transform procedure are denominated as the quantiza-
tion errors. There are two ways to handle the out-shifted
bits, namely truncating and rounding off. Our experiment
proofed that rounding off outperforms truncating, because
the truncation operation always generates the DC errors that
can be accumulated in layer-wise and finally introduces a
large bias.In contrast, the rounding operation introduces the
zero-mean Gaussian white noises, and then no accumulated
bias exists.
The energy of quantization errors of BFP is related to the
distribution of numbers within the block, block size n and
mantissa bit length. To be specific, when Lm is fixed, if X
contains a few numbers with large magnitude while others
are small, the overall precision of X′ is low. When the dis-
tribution of numbers is given, the more numbers one block
contains , the possibility of one block contains large peak
and rather small mean value arises, resulting into a lower
overall precision. Obviously, the precision of BFP is propor-
tionate to Lm.
w01,0 w
0
1,1
w00,0 w
0
0,1
kernel 0
w11,0 w
1
1,1
w10,0 w
1
0,1
kernel 1
i3,0 i3,1 i3,2 i3,3
i2,0 i2,1 i2,2 i2,3
i1,0 i1,1 i1,2 i1,3
i0,0 i0,1 i0,2 i0,3
input feature map
receptive field
⊗
o02,0 o
0
2,1 o
0
2,2
o01,0 o
0
1,1 o
0
1,2
o00,0 o
0
0,1 o
0
0,2
output feature map 0
=⇒
o13,0 o
1
3,1 o
1
3,2
o12,0 o
1
2,1 o
1
2,2
o11,0 o
1
1,1 o
1
1,2
output feature map 1
=⇒
w00,0 w
0
0,1 w
0
1,0 w
0
1,1
w10,0 w
1
0,1 w
1
1,0 w
1
1,1
( )kernel 0
W
i0,0 i0,1 i0,2 i1,1 i1,2 i2,0 i2,0 i2,1 i2,2
i0,1 i0,2 i0,3 i1,2 i1,3 i2,1 i2,1 i2,2 i2,3
i1,0 i1,1 i1,2 i2,1 i2,2 i3,0 i3,0 i3,1 i3,2
i1,1 i1,2 i1,3 i2,2 i2,3 i3,1 i3,1 i3,2 i3,3


I
receptive field
o00,0 o
0
0,1 o
0
0,2 o
0
1,0 o
0
1,1 o
0
1,2 o
0
2,0 o
0
2,1 o
0
2,2
o10,0 o
1
0,1 o
1
0,2 o
1
1,0 o
1
1,1 o
1
1,2 o
1
2,0 o
1
2,1 o
1
2,2
( )
O
=
output feature map 0
Figure 1: convolution operation transformed into matrix
multiplication. “W”, “I” and “O” represent matrices trans-
formed from kernels, input feature maps and output feature
maps respectively. In this figure, the padding and stride are
set to 0 and 1 with 1 channel.
3.2 Matrix Representation of Convolutional
Neural Networks
As transforming the convolution to matrix operation, ker-
nels and input feature maps are expanded into two matri-
ces namely W and I. To be specific, kernels belonging to
the same output feature map compose one row vector ofW,
and receptive fields of input feature maps of one output pixel
constitute one column vector in I. This procedure is illus-
trated as figure 1. The entry in O located at mth row, nth
column corresponds to the output feature map of mth kernel
on nth receptive field.It should be noted that, transforming
CNN to matrix operation is burdensome. Therefore, the high
performance CNN accelerators always apply the direct con-
volution data flow(Chunsheng et al. 2017).In this paper, we
merely adopt the matrix representation to explain the BFP in
CNN computation.
3.3 Hardwired CNN Accelerator Oriented
Matrix Partition for BFP Representation
As aforementioned, block formattingW and I facilitates the
advantages of BFP in hardwired CNN accelerator design.
The precision of BFP is affected by the distribution of num-
bers within the block, the block size and the mantissa bit
length. As the distribution of input feature maps and weights
are predetermined, we can only optimize the other two fac-
tors, namely block size and mantissa bit length, to improve
the overall performance. Under this guideline, the prominent
issue is how to partitionW and I. The matrix multiplication
is written as
OM×N =WM×KIK×N , (2)
where, M , K and N denote the number of output feature
maps, the size of filters, and the size of one output feature
map, respectively. From the entry-wise perspective, matrix
multiplication is represented as
omn = ~w
T
m ·~in (3)
and, if describled in row-wise or column-wise, it is recasted
to
~oTm = ~w
T
m · I (4)
~on =W ·~in (5)
In fact, (2), (3), (4) and (5) illustrate four different ways
to block format W and I. Equation (2) shows W and I are
block formatted as a whole respectively, thus the storage re-
quirement reaches minimum at the price of the worst ac-
curacy loss. (3) presents the vector-wise block formatting
of ~wTm and ~in, respectively. In this case, the minimum loss
is achieved with increasing memory cost. Equation (4) and
(5) represent two balanced approaches that obtained a good
tradeoff between the quantization accuracy and the memory
requirement, for they both block formats one matrix as a
whole while the other one by row vector or column vector.
The complexity and resource consuming comparisons of the
above BFP transform methods are illustrated in Table 1.
Method ALW′ ALI′ NBE
Equation (2) 1 + LW + Le/(M ×K) 1 + LI + Le/(K ×N) 2
Equation (3) 1 + LW + Le/K 1 + LI + Le/K M + N
Equation (4) 1 + LW + Le/K 1 + LI + Le/(K ×N) 1 + M
Equation (5) 1 + LW + Le/(M ×K) 1 + LI + Le/K 1 + N
Table 1: the cost of 4 different methods block formatting
WM×K and IK×N . “ALW′”, “ALI′” are the average stor-
ing length of W′ and I′. “NBE” is the number of block ex-
ponents that need to store.
Consider the layer ”conv1 1” of VGG-16, with the ma-
trix representation of (2), we have M = 64, K = 9 and
N = 50176, whereN is much greater thanM . According to
table 1, equation (3) and (5) involve more than 50176 times
of block formatting operation, besides, the cost of storing
common exponents is hundreds of times (50176/64) larger
than equation (2) and (4). The major difference of equation
(2) and (4) is the block size of W. We tested the influence
of block size on accuracy, shown in table 2. Experiment re-
vealed that the top-1 accuracy of equation (4) is 1.6% higher
than equation(2).Therefore, we choose equation(4) to block
formatW and I.
3.4 Data Flow of Block Formatting in CNN
For instance, it is given that
I =
(
(1.01)2 × 20 (1.01)2 × 20
(1.01)2 × 21 (1.01)2 × 22
)
W =
(
(1.00)2 × 2−1 (1.01)2 × 20
)
Method Top-1 Accuracy Top-5 Accuracy
Equation(2) 0.6672 0.8768
Equation(4) 0.6832 0.884
Floating point 0.6808 0.8816
Table 2: The impact of block size on accuracy, tested in
VGG-16 on dataset ILSVRC12(Russakovsky et al. 2015a)
with batch size set to 50.
Let LW = 3, LI = 3 denominate the block mantissa bit
length of W′ and I′, neglecting the sign bit. After scanning
I, we get the max exponent is εI = 2, and then the entries in
I are right shifted with round-off model to align. Then,
I′ =
(
(0.01)2 (0.01)2
(0.11)2 (1.01)2
)
× 22
It is traced by analogy that,
W′ = ((0.10)2 (1.01)2)× 20
Therefore, the multiplication ofW and I , i.e.,O =WI , can
be approximated as
O ≈W′I′ = 2εOM′O
where,M′O =M
′
WM
′
I and εO = εW + εI.
To avoid involving rounding errors during M′O =
M′WM
′
I, the bit width of multiplier must be no less than
LW + LI + 2, including the sign bit, and the bit width of
accumulator must be no less than LW + LI + 2+ S, where
S = blog2(K)c to prevent overflow as K times binary ad-
dition generates blog2(K)c times carry at most. Details are
shown in figure 2.
4 Error Analysis of Block Floating Point
Convolution Operations
We propose a three-stage error analysis model. The first
stage is the quantization error, the second stage describes
the procedure of error accumulation in matrix multiplica-
tion, and the third one describes how the errors are trans-
ported between convolution layers.
4.1 Quantization Error Analysis Model
According to (Kalliojarvi and Astola 1996), for blockX, the
quantization error has zero mean , and variance σ2
σ2 =
2−2Lm
12
·
Nγ∑
i=1
pγi2
2γi (6)
where Lm is the bit length of block mantissa and pγi(i =
1, · · · , Nγ) is the probability mass function (PMF) of the
block-exponents. Nγ = 2LE is the number of available
block-exponent levels, where LE is the bit length of block
exponent.
As the value of input feature maps and weight filters are
known, pγi is described as below,
pγi =
{
1 i = εX
0 i 6= εX (7)
Substituting (7) to (6), we derive that
σ2α =
2−2Lm
12
· 22·εX (8)
Based on equation (4), the input matrix is treated a K × N
block as a whole and the weight matrix is partitioned into
M numbers of 1×K row vectors. Thus the signal-to-noise
ratio (SNR) of block floating point represented input matrix
is
SNRi = 10 · log10
E(Y 2)
σ2i
(9)
where E(Y 2) is the mean square of input matrix, σ2i is the
energy of quantization error of I′. To be specific,
σ2i =
2−2LI
12
· 22·εI (10)
Similarly, SNR of the mth BFP represented row vector in
the weight matrix is
SNRwm = 10 · log10
E(X2m)
σ2wm
(11)
where E(X2m) is the mean square of the mth row vector of
weight matrix and σ2wm the corresponding energy of quanti-
zation errors that is formulated as,
σ2wm =
2−2LW
12
· 22·ε~wTm (12)
The averaged SNR of whole weight matrix is
SNRw = 10 · log10
∑M
m=1E(X
2
m)∑M
m=1 σ
2
wm
(13)
4.2 Single Layer Error Analysis Model
Matrix multiplication is composed of vector inner products.
Therefore, investigating the vector inner product assists us in
understanding how error is accumulated in BFP represented
matrix multiplication. Giving two vectors with length K as
~P and ~Q, which are block formatted into ~Pb and ~Qb. We
further define ~Pe = ~Pb − ~P and ~Qe = ~Qb − ~Q as the quan-
tization errors. Then the mean square of block floating point
represented inner product σ2r is
σ2r = E((
~Pb · ~Qb)2)
= E((~P · ~Q)2) + E((~Pe · ~Q)2) +
E((~P · ~Qe)2) + E((~Pe · ~Qe)2) (14)
Assuming that ~Pe and ~Qe are statistically independent, and
ignoring the higher order item E((~Pe · ~Qe)2) , we have
σ2r = E((~P · ~Q)2) + E((~Pe · ~Q)2) + E((~P · ~Qe)2)
=
1
K
(1 +
‖~Pe‖2
‖~P‖2 +
‖ ~Qe‖2
‖ ~Q‖2 ) · ‖
~P‖2 · ‖ ~Q‖2 (15)
where
‖~P‖2 =
K∑
k=1
P 2i , ‖ ~Q‖2 =
K∑
k=1
Q2i
WRi
FP2BFP
I
FP2BFP
Fixed point
matrix
multiplication
BFP2FP
O
P
Derive εP
Block based aligning
Round off
P′
FP2BFP
P′
P =M′P · 2εP
P
BFP2FP
+
oij
× × ×
[LW + LI − 1 : 0]
i′i1 w
′
1j i
′
ik
w′kj
[LI − 1 : 0] [LW − 1 : 0]
i′iK w
′
Kj
[LW + LI − 1 + S : 0]
Fixed point matrix multiplication
Figure 2: theoretical data flow of block floating point. Weight matrix and input matrix are block formatted individually and then
matrix multiplication is done via fixed-point accumulators and multipliers. In this figure, LI and LW both includes the sign bit.
‖~Pe‖2
‖~P‖2 and
‖~Qe‖2
‖~Q‖2 , denoted as ηP and ηQ, are noise-to-signal-
ratio (NSR) of ~Pb and ~Qb, which can be derived from SNR,
e.g.
ηP = 10
−SNRP10
where SNRP has been discussed in equation (9). Then the
NSR of inner product is
ηr =
σ2r − E((~P · ~Q)2)
E((~P · ~Q)2)
= ηP + ηQ (16)
Since omn = ~wTm ·~in, we can use equation (15) to cal-
culate its NSR. Further, when calculating the average NSR
of O, we assume that ~wTm are independent and identically
distributed, similarly to~in, then NSR of ~wTm and~in can be
replaced with the NSR ofW′ and I′. Thus the average NSR
ofO, denoted as ηO, is
ηO = ηI′ + ηW′ (17)
where ηI′ and ηW′ are NSR of input matrix and weight ma-
trix. Substituting equation (16), SNR of output matrix is
SNRO = −10 · log10 ηo
= SNRI′ + SNRW′ − 10 · log10
(10
SNR
I′
10 + 10
SNR
W′
10 ) (18)
where SNRI′ and SNRW′ have been discussed in equa-
tion (9) and (13), thus we get the single layer error analysis
model as equation (18).
4.3 Multi-Layers Error Analysis Model
In VGG-16, every convolution layer is followed by a ReLU
layer, and the output of ReLU is the input of next convolu-
tion layer. To simplify our model, we assume that the errors
are uniformly distributed in negative and positive output fea-
ture maps, and then we ignore the impact of ReLU layer on
SNR. The difference between multi-layers model and single
layer model is that the original input feature maps of multi-
layers model carries error while the single layer does not.
Fortunately, the quantization errors are uniformly distributed
in the input signals and the input inherited errors. Hence, we
can utilize single layer model to calculate the new generated
error, and then we use the SNR of last layer to distinguish
the carried error and signal.
η1 and η2 stand for the last layer output NSR and the NSR
of block formatted input feature maps. E(Y 2), σ21 and σ
2
2
are the energy of signal, the energy of error inherited from
the last layer and the energy of quantization error. Based on
equation (9) and (16) ,
η2 =
σ22
E(Y 2) + σ21
(19)
where σ21 = η1 · E(Y 2) and σ22 are derived from equation
(8). And then, the overall NSR η of this input feature map is
η =
η2(E(Y
2) + η1E(Y
2))
E(Y 2)
= η2 + η1η2 (20)
4.4 Deviation of Error Analysis Model
Correlation between Filters and Input Feature Maps
We assumed that weights and input feature maps are statis-
tically independent to simplify our single layer error analy-
sis model. However, when weights and input feature maps
are rather strong correlated, which results into SNR aris-
ing as noise is independent to weights while signal is not.
In this case, our model deviates from it. Another indication
of strong correlation is that strong correlated layers gener-
ate more large values compared with others as filters extract
aimed features from receptive fields. The higher the degree
of coincidence is tends to generates more large values.
VGG-16 top-1 GoogLeNet loss1 top-1 GoogLeNet loss2 top-1 GoogLeNet loss3 top-1
LW
LI LI LI LI
6 7 8 9 6 7 8 9 6 7 8 9 6 7 8 9
6 0.3096 0.1576 0.1246 0.12 0.022 0.0126 0.0122 0.0096 0.0198 0.0138 0.0118 0.01 0.0272 0.0094 0.0088 0.0072
7 0.185 0.0268 0.003 0.0022 0.0102 0.0004 0.0014 0.0012 0.012 0.004 0.0014 0.0008 0.0172 0.0028 0.0014 -0.0004
8 0.1772 0.0168 0.0002 -0.0008 0.0036 -0.0012 -0.0008 -0.0004 0.0156 0.004 0.0008 0.0008 0.017 0.0064 0.0014 0.003
9 0.1764 0.0166 -0.0002 -0.0018 0.0078 -0.002 -0.0004 -0.0012 0.014 0.0002 0.0018 0.0008 0.014 0.0032 0.0004 0.0012
ResNet-18 top-1 ResNet-50 top-1 mnist cifar10
LW
LI LI LW
LI LW
LI
6 7 8 9 6 7 8 9 3 4 5 6 5 6 7 8
6 0.184 0.0584 0.0518 0.0506 0.1038 0.0348 0.0224 0.0186 3 0.0123 0.0068 0.0053 0.0045 5 0.0219 0.0103 0.0105 0.0087
7 0.125 0.019 0.008 0.0052 0.0724 0.0128 0.0064 0.0024 4 0.0051 0.0010 0.0005 -0.0002 6 0.0145 0.0034 0.0014 0.0015
8 0.1228 0.012 0.0026 0 0.0664 0.0074 0.0008 -0.0022 5 0.0054 0.0006 0.0001 -0.0002 7 0.0169 0.0042 0.0028 0.0014
9 0.1134 0.01 -0.0006 0 0.058 0.0084 0.0028 0.0004 6 0.0051 0.0010 0.0004 -0.0005 8 0.0166 0.0014 0.002 -0.0009
Table 3: Drop of accuracy in VGG-16, GoogLeNet, ResNet-18, ResNet-50 ,cifar10 and mnist. LW and LI represent the block
mantissa bit length ( including the sign bit ) ofW′ and I′ respectively.
ReLU Layer ReLU(Glorot, Bordes, and Bengio 2011) is
a nonlinearity layer, which drops values smaller than zero
and keeps positive values as they are. In VGG-16, each con-
volution layer is followed by a ReLU layer, of which the
outputs are dispatched to the following convolution layer or
max pooling layer. In our multi-layers model, we used SNR
of last convolution layer’s output as SNR of next convolution
layer’s input matrix, thus the influence of ReLU layer is ig-
nored. Further, our model works for any activation function
whose derivate is descending, because their output NSR is
always no greater than input NSR (Liu et al. 2016) (we rec-
ommend readers to read lemma 1 this literature it for more
detailed proof).
Pooling Layer VGG-16 uses max pooling layer every sev-
eral convolution layers to lessen the number of parameters
and to control overfitting. A max pooling layer extracts the
biggest number of 2×2 receptive filter with stride 2. It seems
reasonable to assume that pooling layer always promote the
overall SNR, if we assume bigger magnitude is sum of the
products of bigger multiplier, and because bigger magni-
tudes have higher SNR when represented in block floating
point, the SNR of the biggest number with the 2 × 2 filter
is higher than the average SNR of the filter. However, this
does not necessarily be true as it is possible that big positive
and negative magnitudes offset each other, resulting a rather
small value, while smaller magnitudes accumulated to a big
one that is selected as the output. Because of the uncertainty
pooling layer’s impact on SNR, we take the output SNR of
pooling layer as the input SNR of next layer.
5 Experiments
5.1 Accuracy Verification of BFP CNN
The magnitude of the decrease in accuracy is one of the most
important criteria for measuring the performance of CNN
accelerators. We verified BFP arithmetic on several typi-
cal deep neural networks, including VGG-16, GoogLeNet,
ResNet-18 and ResNet-50, besides, smaller convolution
neural networks like mnist and cifar10 are also tested.
Experiment Setup Caffe(Jia et al. 2014) is a popular deep
learning framework, which turns convolution operations to
matrix multiplications. It is convenient to apply BFP in CNN
based on Caffe as we only need to rewrite the convolution
function in caffe under the instruction of figure 2. To be spe-
cific, input feature maps and weights are block formatted
accordingly, and then matrix multiply, finally the output fea-
ture map is transformed to floating point representation as
O′ holds different block exponent for different row vector,
because weights are block formatted row by row. It should
be pointed out that ReLU and pooling layers remained un-
changed, but this has no impact on our test as these two lay-
ers do not involve numeric computation.
Results Results are shown in Table 3. LW and LI denote
the bit length of weight and input mantissa after block for-
matted, including sign bit. For deep neural networks, when
set LW and LI no less than 8, the drop of accuracy is less
than 0.3%. In addition, 4-bit mantissa and 7-bit mantissa
are sufficient for mnist and cifar10 respectively. In the ex-
periments, we used the original models without any retrain-
ing, and then block formatted them with different mantissa
length respectively. Thus the accuracy differences are intro-
duced by the quantization errors merely.
Another noteworthy is that the decrease of accuracy is
more sensitive to LI than LW. This is attributed to two fac-
tors, namely the block size of I′ is much larger than the size
ofW′, and the dynamic range of input feature map is much
larger than that in weights.
To draw a conclusion, when designing FPGA based CNN
accelerators, BFP is a superduper numeric format as BFP
eliminates the complex floating-point computations in con-
volution operation, while maintaining the high classification
accuracy. Further, because BFP oriented accelerator does not
acquire retraining, the cost of implementing BFP is low. Our
experiments revealed that BFP can be used in a variety of
convolution neural networks without specific reconfigura-
tion.
Layer ex SNR single SNR multi SNR
conv1 1
input 40.1236 41.8047 –
weight 43.9925 44.3538 –
output 37.5638 39.8845 –
ReLU 37.5641 – –
conv1 2
input 27.2022 26.9376 26.7227
weight 36.5345 37.3569 37.3569
output 35.1682 26.5601 26.3628
ReLU 35.1707 – –
pool1 max 36.3581 – –
conv2 1
input 27.5767 29.3567 28.5668
weight 34.1054 35.347 35.347
output 30.0439 28.3815 27.7393
ReLU 30.0446 – –
conv2 2
input 23.7616 25.7545 23.6242
weight 33.7565 34.9562 34.9562
output 25.3109 25.2616 23.3158
ReLU 25.311 – –
pool2 max 26.2151 – –
conv3 1
input 23.9214 27.9558 23.9885
weight 31.3016 32.899 32.899
output 25.2734 26.7488 23.4634
ReLU 25.2733 – –
conv3 2
input 21.4743 24.109 20.7639
weight 30.7485 32.1746 32.1746
output 23.1478 23.479 20.4609
ReLU 23.1478 – –
conv3 3
input 20.1885 24.2099 18.9325
weight 29.8594 31.3544 31.3544
output 21.0608 23.4435 18.6907
ReLU 21.0608 – –
pool3 max 21.7996 – –
conv4 1
input 20.7986 25.7334 20.3252
weight 31.0773 32.5038 32.5038
output 22.9078 24.9042 20.0699
ReLU 22.9077 – –
conv4 2
input 19.3041 23.882 18.5602
weight 31.0578 32.3566 32.3566
output 21.9051 23.305 18.3827
ReLU 21.9049 – –
conv4 3
input 18.2669 24.0675 17.3443
weight 30.2625 31.6326 31.6326
output 22.4316 23.3665 17.1855
ReLU 22.4312 – –
pool4 max 18.8514 – –
conv5 1
input 18.5113 24.4103 17.786
weight 31.0754 32.242 32.242
output 22.149 23.7479 17.6331
ReLU 22.1483 – –
conv5 2
input 18.4841 23.5987 16.6529
weight 33.0316 33.9193 33.9193
output 22.2687 23.2129 16.5772
ReLU 22.2369 – –
conv5 3
input 18.1074 24.1601 15.8788
weight 32.4689 33.654 33.654
output 23.6306 23.6976 15.7846
ReLU 23.6191 – –
pool5 max 17.7955 – –
Table 4: Experimental and theoretical SNR. In this table, “ex
SNR”, “single SNR” and “multi SNR” respectively repre-
sent experimental SNR, single layer model calculated SNR
and multi-layer model calculated SNR.
Figure 3: energy distribution comparison of layer “conv1 1”,
“conv1 2”, “conv2 1” and “conv2 2”. The horizontal axis
represents normalized magnitude from 0.8 to 1, and the area
shows the comparison of each layer’s normalized energy.
5.2 Error Analysis Model Verification
Experiments Setup To verify error analysis model, we
defined floating point represented numbers as signals, and
the differences between floating point represented numbers
and BFP represented numbers as errors. And then, we ran
VGG-16 on ILSVRC2012 for 20 iterations with batch size
set to 50 to gather data, such as the output of every layer
and the input feature maps and weights of convolution layer.
These data are stored in separated files in binary format, with
which we calculate the signal energy and error energy to de-
rive the experimental SNR.
Results As shown in Table 4, the theoretical analysis
agrees well with the experimental data, where the biggest
difference between them is less than 8.9dB, which is close
enough to guide hardware design. What worth to mention is
that the previous assumptions about ReLU layer is proved
to be reasonable. To be specific, the SNR of ReLU output is
consistent with its input SNR, which proved that the output
of convolution layer is evenly distributed in the positive and
negative regions. And, the impact on SNR of pooling layer
performs exactly as what we assumed.
We calculated the energy distribution of layer “conv1 2”
as it induces the largest deviation, layer “conv1 1”,
“conv2 1” and “conv2 2” are also tested as reference. Figure
3 reveals that, compared with other two layers, the energy of
layer “conv1 2” is more concentrated at large value, which
indicates stronger correlated.
6 Conclusion
In this paper, we designed a CNN accelerator that substituted
floating point representation with BFP representation. Us-
ing BFP, the burdensome floating-point arithmetics in con-
volution layers, which is the majority of the overall CNN ar-
chitecture, are replaced by the light fixed-point arithmetics.
Using 8-bit mantissa, the worst accuracy drop of deep neu-
ral networks is less than 0.3% without retraining. In addi-
tion, we developed the NSR upper bound analytical model
with the largest deviation less than 8.9dB, which provides
the guidance for hardware design.
7 Acknowledgement
This work is supported by the National Key Research and
Development (2016YFB0200505).
References
Chen, Y.-H.; Krishna, T.; Emer, J. S.; and Sze, V. 2017. Eyeriss:
An energy-efficient reconfigurable accelerator for deep convo-
lutional neural networks. IEEE Journal of Solid-State Circuits
52(1):127–138.
Chunsheng, M.; Zhenyu, L.; Yue, N.; Xiangyang, J.; Wei, Z.;
and Dongsheng, W. 2017. A 200mhz 202.4gflops@10.8w
vgg16 accelerator in xilinx vx690t. In IEEE Global Conference
on Signal and Information Processing, accepted. IEEE.
Ciregan, D.; Meier, U.; and Schmidhuber, J. 2012. Multi-
column deep neural networks for image classification. In Com-
puter Vision and Pattern Recognition (CVPR), 2012 IEEE Con-
ference on, 3642–3649. IEEE.
Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Deep sparse
rectifier neural networks. In Proceedings of the Fourteenth In-
ternational Conference on Artificial Intelligence and Statistics,
315–323.
Goldberg, Y. 2016. A primer on neural network models for
natural language processing. J. Artif. Intell. Res.(JAIR) 57:345–
420.
Gupta, S.; Agrawal, A.; Gopalakrishnan, K.; and Narayanan,
P. 2015. Deep learning with limited numerical precision. In
Proceedings of the 32nd International Conference on Machine
Learning (ICML-15), 1737–1746.
Han, S.; Pool, J.; Tran, J.; and Dally, W. 2015. Learning both
weights and connections for efficient neural network. In Ad-
vances in Neural Information Processing Systems, 1135–1143.
Han, S.; Mao, H.; and Dally, W. J. 2015. Deep compression:
Compressing deep neural networks with pruning, trained quan-
tization and huffman coding. arXiv preprint arXiv:1510.00149.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual
learning for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition, 770–
778.
Hill, P.; Zamirai, B.; Lu, S.; Chao, Y.-W.; Laurenzano, M.;
Samadi, M.; Papaefthymiou, M.; Mahlke, S.; Wenisch, T.;
Deng, J.; et al. 2016. Rethinking numerical representations
for deep neural networks.
Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Gir-
shick, R.; Guadarrama, S.; and Darrell, T. 2014. Caffe: Convo-
lutional architecture for fast feature embedding. arXiv preprint
arXiv:1408.5093.
Jouppi, N. P.; Young, C.; Patil, N.; Patterson, D.; Agrawal, G.;
Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al.
2017. In-datacenter performance analysis of a tensor processing
unit. arXiv preprint arXiv:1704.04760.
Kalliojarvi, K., and Astola, J. 1996. Roundoff errors in block-
floating-point systems. IEEE Transactions on Signal Process-
ing 44(4):783–790.
Karam, R.; Paul, S.; Puri, R.; and Bhunia, S. 2017. Memory-
centric reconfigurable accelerator for classification and ma-
chine learning applications. ACM Journal on Emerging Tech-
nologies in Computing Systems (JETC) 13(3):34.
Kim, Y. 2014. Convolutional neural networks for sentence
classification. arXiv preprint arXiv:1408.5882.
Li, H.; Fan, X.; Jiao, L.; Cao, W.; Zhou, X.; and Wang, L. 2016.
A high performance fpga-based accelerator for large-scale con-
volutional neural networks. In Field Programmable Logic and
Applications (FPL), 2016 26th International Conference on, 1–
9. IEEE.
Liu, Z.; Yu, X.; Gao, Y.; Chen, S.; Ji, X.; and Wang, D. 2016.
Cu partition mode decision for hevc hardwired intra encoder
using convolution neural network. IEEE Transactions on Image
Processing 25(11):5088–5103.
Mellempudi, N.; Kundu, A.; Das, D.; Mudigere, D.; and Kaul,
B. 2017. Mixed low-precision deep learning inference using
dynamic fixed point. arXiv preprint arXiv:1701.08978.
Ovtcharov, K.; Ruwase, O.; Kim, J.-Y.; Fowers, J.; Strauss, K.;
and Chung, E. S. 2015. Accelerating deep convolutional neu-
ral networks using specialized hardware. Microsoft Research
Whitepaper 2(11).
Page, A., and Mohsenin, T. 2016. Fpga-based reduction
techniques for efficient deep neural network deployment. In
Field-Programmable Custom Computing Machines (FCCM),
2016 IEEE 24th Annual International Symposium on, 200–200.
IEEE.
Parashar, A.; Rhu, M.; Mukkara, A.; Puglielli, A.; Venkatesan,
R.; Khailany, B.; Emer, J.; Keckler, S. W.; and Dally, W. J.
2017. Scnn: An accelerator for compressed-sparse convolu-
tional neural networks. In Proceedings of the 44th Annual Inter-
national Symposium on Computer Architecture, 27–40. ACM.
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma,
S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg,
A. C.; and Fei-Fei, L. 2015a. ImageNet Large Scale Visual
Recognition Challenge. International Journal of Computer Vi-
sion (IJCV) 115(3):211–252.
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma,
S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al.
2015b. Imagenet large scale visual recognition challenge. In-
ternational Journal of Computer Vision 115(3):211–252.
Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.;
Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Pan-
neershelvam, V.; Lanctot, M.; et al. 2016. Mastering the
game of go with deep neural networks and tree search. Nature
529(7587):484–489.
Simonyan, K., and Zisserman, A. 2014. Very deep con-
volutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556.
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov,
D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going
deeper with convolutions. In The IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR).
Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; and Cong,
J. 2015. Optimizing fpga-based accelerator design for deep
convolutional neural networks. In Proceedings of the 2015
ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays, 161–170. ACM.
