High-Accuracy and Fault Tolerant Stochastic Inner Product Design by Haselmayr, Werner et al.
1High-Accuracy and Fault Tolerant Stochastic Inner Product Design
Werner Haselmayr, Member, IEEE, Daniel Wiesinger, Michael Lunglmayr, Member, IEEE
Abstract—In this work, we present a novel inner product
design for stochastic computing. Stochastic computing is an
emerging computing technique, that encodes a number in the
probability of observing a one in a random bit stream. This
leads to reduced hardware costs and high error tolerance.
The proposed inner product design is based on a two-line
bipolar encoding format and applies sequential processing of
the input in a central accumulation unit. Sequential processing
significantly increases the computation accuracy, since it allows
for preliminary cancelation of carry bits. Moreover, the central
accumulation unit gives a much better scalability compared to
conventional adder tree approaches. We show that the proposed
inner product design outperforms a state-of-the-art design in
terms of hardware costs for high accuracy requirements and
fault tolerance.
Index Terms—Inner product, non-scaled adder, stochastic com-
puting, two-line representation
I. INTRODUCTION
STOCHASTIC computing is an emerging computing tech-nique that encodes a real-valued number into a random
bit stream [1], representing the number as the probability of
observing the bit one. This representation allows for a low-
complexity implementation of basic arithmetic operations, us-
ing only a few logic gates. For instance, the complex multiplier
used in conventional binary computing can be replaced by
an AND gate in stochastic computing. Moreover, compared to
the binary radix representation, the stochastic representation
has a high degree of error tolerance [2].
Stochastic computing has been successfully applied in
many areas, including decoding of error detection codes,
control systems, image processing, filter design and neural
networks (e.g., [2], [3] and references therein). In many of
these applications the inner product is a main building block
and, thus, an implementation with low hardware effort and
high accuracy is desired. In particular, in neural networks inner
products are used to model the operation of the neurons [4].
Moreover, the FIR filter operation [5] and the DFT/FFT
computation [6] are based on the inner product of two vectors.
A straightforward implementation of the inner product using
an adder tree with multiplexer-based scaled adders, scales
down the result, causing severe accuracy loss especially for
large vectors. To overcome the scaling problem, stochastic
to binary conversion was applied in [7], [8]. However, this
comes with the drawback of such a conversion: Additional
hardware costs and lower fault tolerance due to the intermedi-
ate binary representation. Recently, two approaches have been
proposed, addressing the scaling issue within the stochastic
domain [5], [6]. In [5], an adder tree implementation with
multiplexer-based adders using uneven-weights is presented.
This reduces the downscaling factor or even scales up the
result for certain input values. Unfortunately, the computation
of the weights is very complex, and, thus they are often
pre-calculated, requiring at least one constant input vector.
Moreover, for large vectors the accuracy of the result still
degrades due to the growing scaling factor. In [6], the scaled-
adders are replaced by counter-based non-scaled adders using
the two-line signed magnitude format [9]. It was shown that
when applied to a DFT/FFT implementation it achieves a
significantly higher accuracy than the approach in [5]. Al-
though, the approach in [6] seems very promising there are
still some shortcomings. The hardware effort is significantly
higher compared to [5], since the non-scaled adder requires
more logic gates than a simple scaled adder. Moreover, the
accuracy of non-scaled adders is based on the preservation of
carry bits, which can be improved by increasing the counter
length. Hence, to prevent overflow errors in an adder tree all
counter lengths must be increased, leading to a poor scalability
in terms of hardware effort.
In this paper, we present a novel stochastic inner product
design. We employ the two-line bipolar format [1], enabling a
simpler design and achieving higher accuracy compared to [6].
However, we propose simple conversion circuits between the
two-line bipolar and the signed magnitude format used in [6],
making the proposed implementation also applicable for the
signed magnitude format. Instead of an adder tree we use
sequential processing of the input in a central accumulation
unit, which is realized by a shift-register-based non-scaled
adder. The use of a central accumulation unit significantly
increases the scalability compared to [6]. Moreover, sequential
processing together with the two-line bipolar representation
allows for preliminary cancelation of carry bits. This approach
reduces the probability of an overflow in the carry register and
gives high-accuracy results.
II. STOCHASTIC COMPUTING FORMATS
In this section, we provide an overview on single- and two-
line encoding formats used in stochastic computing. Single-
line formats encode a desired number in a single stochastic
stream, while two-line formats use two streams.
A. Single-Line Encoding Formats
Unipolar Format: In the unipolar format, the value of a
deterministic number x ∈ [0, 1] is encoded in a stochastic bit
stream X of length L, with [1]
x =
1
L
L∑
l=1
X[l], (1)
where X[l] ∈ {0, 1} denotes the lth bit of the bit stream X .
The precision (representation resolution) of the unipolar for-
mat is given by 1/L. Based on this format, basic arith-
metic operations can be implemented using simple logic
gates (e.g., AND gate for multiplication) [1].
Bipolar Format: In contrast to the unipolar format, the
bipolar format can also represent negative numbers. This is ac-
complished through a different interpretation of the stochastic
stream. In this case a number x ∈ [−1, 1] can be represented
by a bit stream X of length L by [1]
x =
1
L
L∑
l=1
2X[l]− 1. (2)
ar
X
iv
:1
80
8.
06
50
0v
2 
 [c
s.E
T]
  2
0 N
ov
 20
18
2TABLE I
TLB AND SM FORMAT CONVERSION
XSM/TLB[l] Xs[l] Xm[l] Xp[l] Xn[l]
−1 1 1 0 1
1 0 1 1 0
0 0/1 0 0/1 0/1
The precision of the bipolar format is given by 2/L, i.e. half
the resolution of the unipolar format. Similar to the unipolar
format, the circuits for basic arithmetic operations are very
simple [1].
B. Two-Line Encoding Formats
Signed Magnitude Format: In the signed magnitude (SM)
format, the sign and magnitude information of a num-
ber x ∈ [−1, 1] is carried by the bit streams Xs and Xm,
respectively. Hence, x can be represented as [9]
x =
1
L
L∑
l=1
(1− 2Xs[l])Xm[l], (3)
where Xm[l] and Xs[l] denote the lth bit of the bit streams Xs
and Xm, respectively. The SM representation achieves the
same resolution as the unipolar format, i.e. 1/L, while main-
taining the same range as the bipolar format. Although,
the hardware effort for basic arithmetic operations is higher
compared to the unipolar and bipolar format [9], it enables an
efficient implementation of a non-scaled adder [6]. Non-scaled
adders1 are very important if multiple successive additions are
required (e.g. in an adder tree) since it avoids downscaling.
Thus, non-scaled adders are crucial building blocks for an
inner product design.
Two-Line Bipolar Format: The two-line bipolar (TLB) for-
mat uses a different interpretation of the bit streams compared
to the SM format. In particular, a number x ∈ [−1, 1] is
interpreted as the difference between the numbers xp ∈ [0, 1]
and xn ∈ [0, 1], which are encoded as unipolar bit streams Xp
and Xn. Hence, x can be represented as [1]
x =
1
L
L∑
l=1
Xp[l]−Xn[l], (4)
where Xp[l] and Xn[l] denote the lth bit of the bit stream Xp
and Xn, respectively. The resolution of the TLB format is
given by 1/L. Similar to the SM format, the circuits for
the basic arithmetic operations are slightly more complex
than for the unipolar and bipolar format [1], but the TLB
format also enables an efficient non-scaled adder implemen-
tation (see Fig. 3). As discussed above, non-scaled adders
are an important building block for an efficient inner product
implementation.
Format Conversion: For the conversion between the SM
and TLB format we define (see (3))
XSM[l] = (1− 2Xs[l])Xm[l], (5)
and (see (4))
XTLB[l] = Xp[l]−Xn[l], (6)
1To the best of our knowledge, so far no non-scaled adder has been proposed
for the bipolar format.
Xp
Xs
Xm
Xn
Xp
(a)
Xs
Xm
(b)
Xn
Fig. 1. Conversion circuits: (a) TLB to SM; (b) SM to TLB.
with x = 1/L
∑L
l=1X
SM/TLB[l]. The pairs (Xs[l], Xm[l])
and (Xp[l], Xn[l]) jointly contribute to XSM[l] and XTLB[l],
respectively, and the elements of XSM[l] and XTLB[l] can only
be within the set {−1, 0, 1}. Tab. I summarizes these relations,
which can be used to derive conversion circuits between the
TLB and the SM format as shown in Fig. 1. For instance,
if Xp[l] = Xn[l] then Xm[l] = 0 and when Xp[l] 6= Xn[l]
then Xm[l] = 1. Hence, the conversion circuit from the TLB
format to the magnitude stream of the SM format can be
realized through an XOR gate (see Fig. 1(a)).
III. TLB BUILDING BLOCKS
In this section, we present the main building blocks for the
TLB format.
A. Bit Stream Generator
As discussed in Sec. II-B, the number x ∈ [−1, 1] is
defined by the difference between the numbers xp ∈ [0, 1] and
xn ∈ [0, 1], which are encoded into unipolar format bit streams.
Since only the difference matters, xp and xn are ambiguous and
the bit stream generation can be simplified if one of the values
is set to zero. Hence, we represent x as either xp (if x ≥ 0)
or xn (if x < 0), generate the corresponding stochastic bit
stream Xp or Xn and set the other bit stream to zero. The
conversion of xp to Xp or xn to Xn can realized using a random
generator and a comparator [1].
B. Multiplier
A first circuit of a multiplier for the TLB format has been
proposed in [1]. In Fig. 2 we present an alternative multiplier
circuit. The core circuit corresponds to the multiplier for the
SM format (XOR and AND gate) [6] and the interface corre-
sponds to the conversion circuit between TLB and SM format
shown in Fig. 1. It is important to note that the presented
circuit is only used for illustration purpose and a more simple
design can be obtained through logic optimization.
C. Non-Scaled Adder
To the best of our knowledge, only scaled adders have been
proposed for the TLB format (see [1]). Thus, we present
a novel shift-register-based non-scaled adder2 as shown in
Fig. 3. The circuit consists of an update logic and carry
shift registers pc and nc, each of size M . The update logic
must consider many different cases, including the preservation
and cancellation of carry bits in the carry shift registers. For
example, let’s consider the numbers x, y and their sum z.
According to (6), these number can be represented as streams
X[l], Y [l], Z[l] ∈ {−1, 0, 1}. However, since each element of
2Although also a counter-based non-scaled adder can be used, we propose a
shift-register-based implementation because of its higher fault tolerance [10].
3Zn
Zp
Yn
Yp
TLB to SM SM to TLB
Xp
Xn
Fig. 2. Circuit of the stochastic multiplier for the TLB format.
the result Z[l] can only be within the set {−1, 0, 1}, Z[l] is
not only the sum of X[l] and Y [l], but the effect of carry must
be considered. If both X[l] and Y [l] are either 1 or −1, Z[l]
is either 1 or −1 and a carry 1 (pc shift in) or −1 (nc shift
in) should be stored in the carry shift registers for the next
calculation. However, it is also possible that the current carry
bit cancels a stored carry bit from a previous calculation, e.g.
a generated carry 1 cancels a stored carry −1 (nc[1] = 1). The
update logic algorithm given in Alg. 1 takes into account all
this different scenarios. Please note that the shift in operation
denotes that a one (carry bit) is shifted into the register on
one side, while the shift out operation denotes that a zero is
shifted into the register on the other side, i.e. a carry bit is
shifted out of the register.
Algorithm 1 Update Logic for Non-Scaled Adder
Input: X , Y
Initialization: nc = 0, pc = 0
1: for i = 1 to L do
2: if X[l] + Y [l] = 0 then
3: Z[l]← pc[1]− nc[1]; pc and nc shift out
4: else if X[l] + Y [l] = 1 then
5: Z[l]← 1− nc[1]; nc shift out
6: else if X[l] + Y [l] = −1 then
7: Z[l]← pc[1]− 1; pc shift out
8: else if X[l] + Y [l] = 2 then
9: Z[l]← 1
10: if nc[1] = 1 then
11: nc shift out
12: else
13: pc shift in
14: else if X[l] + Y [l] = −2 then
15: Z[l]← −1
16: if pc[1] = 1 then
17: pc shift out
18: else
19: nc shift in
20: return Z
IV. STOCHASTIC INNER PRODUCT DESGIN
In this section, we present the stochastic inner product
implementation. The architecture is shown in Fig. 4, including
a multiplier stage, input shift registers with carry canceling
and an accumulation stage. For the following description we
Xp
Xn
Yp
Zp
ZnYn
Carry Shift Registers
... 10 ...1 0
pc nc
Update Logic
(see Alg. 1)
pc[M] nc[1]pc[1] nc[M]
shift in/out
Fig. 3. Circuit of the shift-register-based stochastic non-scaled adder for the
TLB format.
consider the computation of the inner product between the
vectors x = [x1, . . . , xK ]T and y = [y1, . . . , yK ]T given by
z = 〈x,y〉 = xTy =
K∑
k=1
xkyk, (7)
with xk, yk ∈ [−1, 1]. The numbers xk and yk are encoded in
the stochastic bit streams (Xp,k, Xn,k) and (Yp,k, Yn,k) using
the TLB format.
1) Multiplier Stage: This stage performs the multiplication
of the individual entries of the input vectors, i.e. vk = xkyk,
using K stochastic multipliers as shown in Fig. 2. Each
multiplier has the streams (Xp,k, Xn,k) and (Yp,k, Yn,k) at its
input and generates the streams (Vp,k, Vn,k). The individual
bits of the output streams are stored for one clock cycle of the
main clock in the input hold registers ph and nh, respectively.
These registers prevent intermediate results from propagating
from the main clock domain (multiplier stage) into the higher
clock domain (input shift registers, accumulation stage).
2) Input Shift Registers with Carry Canceling: Upon a
rising edge of the main clock, the elements of the input hold
registers ph and nh are copied into the input shift registers ps
and ns, following the mapping: ph[1]→ ps[1], ph[2]→ ps[2],
etc., and nh[1]→ ns[K], nh[2]→ ns[K−1], etc. This type of
mapping increases the probability that ones are canceled by the
so-called carry canceler (CC). The aim of the CC is to reduce
the number of ones that are shifted towards the accumulation
stage, which reduces the probability of an overflow of the
carry shift registers. Hence, this improves the accuracy of the
inner production calculation. The CC circuit is shown in Fig. 4,
where the outputs are zero if both inputs are one and otherwise
the outputs follows the inputs. The diagonal elements of the
input shift registers are connected by the CC (see Fig. 4) and,
thus, the value of the kth register element after the shifting
operation is given by
ps[k] = ps[k + 1]ns[K − k + 1]
ns[k] = ns[k + 1]ps[K − k + 1], (8)
where (·) denotes the negation operator. Please note that (8)
corresponds to the Boolean function of the CC. In particular,
the canceling procedure is as follows: Upon a rising edge of
the higher clock, the CC output is written into the next register
element. This corresponds either to shifting the value of the
previous element to the next element (normal shift operation)
or writing zeros (carry canceling).
The elements ps[1] and ns[1] are sequentially shifted to the
accumulation stage using a higher clock compared to the main
clock. Please note that the input shift registers ps and ns are
shifted in opposite directions (see Fig. 4), which reduces the
probability that ones are shifted to the accumulation stage
compared to shifting in the same direction. This is because in
4...
...
...
...
CCC
...
CCC
0
0
... 10 ...1 0
nh
ph
ps[1]
nh[1]
nsps
pc nc
Accumulation Stage
Input Shift Registers
Multiplier Stage
Update Logic
(see Alg. 2)
pc[M] nc[1]
High Clock Domain
Main Clock
Carry Canceler (CC)
Vp,1
Vp,2
Vp,K
Vn,1
Vn,2
Vn,K
MULT
Xp,1
Xn,1
Yp,1
Yn,1
MULT
Xp,2
Xn,2
Yp,2
Yn,2
MULT
Xp,K
Xn,K
Yp,K
Yn,K
Zp
Zn
ns[K]
ps[K]
ph[1]
ph[K]
ns[1]
nh[K]
CCC
...
pc[1] nc[M]
shift in/out
Fig. 4. Architecture of the novel stochastic inner product desgin.
0 20 40 60
0
0.1
0.2
0.3
Input Shift Register Length K
P
p
=
P
n
Opposite Direction
Same Direction
Fig. 5. Average probability that ones are shifted towards the accumulation
stage Pp, Pn versus the input shift register length K, assuming that the
probability that ones are copied from the input hold registers to the input
shift registers is 0.5.
case of shifting in the same direction the CC has only an effect
after the first shifting operation. We validated the impact of the
shifting direction through bit-true simulations. Therefore we
evaluated the average probability that ones are shifted towards
the accumulation stage during sequential processing of the
entire input shift registers ps and ns, which is given by
Px =
1
K
K∑
j=1
Pr(x(j)s [1] = 1), (9)
with x ∈ {p, n} and Pr(x(j)s [1] = 1) denotes the probability
that that a one is shifted towards the accumulation stage
after the jth shift operation. The results are shown in Fig. 5,
confirming that for K ≥ 2 shifting in the opposite direction
should be preferred to shifting in the same direction.
3) Accumulation Stage: The accumulation stage cor-
responds to a shift-register-based non-scaled adder (see
Sec. III-C), which accumulates the output of the input shift
registers in the carry shift registers. Similar to the non-
scaled adder, the accumulation stage considers many different
scenario, including the preservation and cancellation of carry
bits. The corresponding algorithm is given in Alg. 2
It is important to note that the sequential processing of the
input shift registers must be finished upon the next rising edge
of the main clock. Then, the input shift registers are loaded
with the next inputs from the input hold registers. Moreover,
Algorithm 2 Update Logic for Accumulation Stage
Input: X = ps[1]− ns[1], C = pc[1]− nc[1]
1: for k = 1 to K do
2: if X = 0 & C = 0 then
3: pc and nc shift out
4: else if X = 1 then
5: if C = 0 then
6: pc shift in (pc[1] = nc[1] = 0) or
nc shift out (pc[1] = nc[1] = 1)
7: else if C = −1 then
8: nc shift out
9: else
10: pc shift in
11: else if X = −1 then
12: if C = 0 then
13: nc shift in (pc[1] = nc[1] = 0) or
pc shift out (pc[1] = nc[1] = 1)
14: else if C = 1 then
15: pc shift out
16: else
17: nc shift in
18: Shift ps and ns
19: X = ps[1]− ns[1]
the entries pc[1] and nc[1] of the carry shift registers are shifted
to the output flip-flops corresponding to the lth bit in the output
stochastic streams i.e. (Zp[l], Zn[l]).
V. PERFORMANCE ANALYSIS
In this section, we compare the proposed inner product
design with the state-of-the-art design presented in [6] in terms
resource utilization and fault tolerance for different accuracy
requirements. For the comparison, we only consider the inner
product calculation and omit the costs for the stochastic stream
generation and the back conversion, since they are similar for
both approaches.
We define the computation accuracy by the root mean
square error (RMSE) given by RMSE =
√
mean(|zˆ − z|),
where z denotes the true inner product result (double-precision
floating point) and zˆ corresponds to the results of the particular
stochastic implementation. The accuracy is controlled by the
50 20 40 60
0
500
1,500
2,500
Input Vector Length K
N
o.
L
og
ic
E
le
m
en
ts Novel design
SoA design [6]
Fig. 6. Number of logic elements required for the novel and state-of-the-art
implementation. Blue, red and green curves correspond to RMSE ≤ 0.1,
RMSE ≤ 0.05 and RMSE ≤ 0.02.
0 20 40 60
0
100
200
300
Input Vector Length K
N
o.
R
eg
is
te
rs
Novel design
SoA design [6]
Fig. 7. Number of registers required for the novel and state-of-the-art
implementation. Blue, red and green curves correspond to RMSE ≤ 0.1,
RMSE ≤ 0.05 and RMSE ≤ 0.02.
carry shift register length and the counter length for the
novel and the state-of-the-art design, respectively. For all
investigations we fixed the length of the stochastic stream
to L= 104.
We determined the resource utilization for both implementa-
tions through synthesis for an Altera Cyclone IV EP4CE115
FPGA. Figs. 6 and 7 show the minimum number of logic
elements (combinational logic) and registers that are required
to achieve a certain computation accuracy. We observe from
Fig. 6 that the logic element utilization of the proposed
design is much better compared to the state-of-the-art design,
especially for large input vectors and if high computation
accuracy is required. Moreover, we observe from Fig. 7 that if
low accuracy is sufficient, the state-of-the-art approach outper-
forms the novel design in terms of register utilization. This is
because the approach in [6] requires no hold circuit at the input
(input hold registers) or sequential processing storage (input
shift registers). Interestingly, for the novel design the logic
element and register utilization is almost independent of the
accuracy requirements, while it increases for the state-of-the-
art implementation. This means that for the proposed design
the additional hardware effort (larger carry shift registers) to
achieve a better accuracy is insignificant.
Fig. 8 compares the fault tolerance of the novel and state-of-
the-art inner product design. Therefore, we randomly flipped a
bit in the carry shift registers or the counters in the adder tree
with probability Pflip. This approach gives a good approxima-
tion of the fault tolerance for the entire design, since failures in
the storage can also be interpreted as bit flips coming from the
combinational logic. We used input vectors of length K = 16
and started with the computation accuracy RMSE= 0.02. This
requires a carry shift register length of 6 and a counter length
0 1 2 3 4 5
·10−2
0
0.2
0.4
0.6
Bit flip prob. Pflip
R
M
SE
Novel design
SoA design [6]
Fig. 8. Robustness against bit flips of the novel and state-of-the-art inner
product design.
of 4. We observe that the proposed design is much more
robust against bit flips than the state-of-the-art implementation.
This is mainly because we use a shift-register-based approach,
rather than a counter-based approach.
For the novel design it is important to note that although the
high clock domain can operate nearly at maximum platform
speed (short critical path), the main clock is reduced by the
input vector length K. However, this issue can be easily solved
through parallelization of the sequential processing step, using
multiple inner product cores.
VI. CONCLUSIONS
In this work, we proposed a novel stochastic inner product
design. In contrast to state-of-the art adder tree implemen-
tations, we performed the addition in a central accumulation
unit by applying sequential processing of the input. The central
accumulation unit increases the scalability and sequential pro-
cessing enables preliminary carry canceling which improves
the computation accuracy. Performance analysis revealed that
the proposed design significantly reduces the hardware costs
for high accuracy requirements and provides a high fault
tolerance compared to a state-of-the-art design.
REFERENCES
[1] B. R. Gaines, Stochastic Computing Systems. Boston, MA: Springer
US, 1969, pp. 37–172.
[2] A. Alaghi, W. Qian, and J. P. Hayes, “The promise and challenge
of stochastic computing,” IEEE Trans. Comput.-Aided Design Integr.
Circuits Syst., pp. 1–1, 2017.
[3] A. Alaghi and J. P. Hayes, “Survey of stochastic computing,” ACM
Trans. Embed. Comput. Syst., vol. 12, no. 2s, pp. 92:1–92:19, May 2013.
[4] V. T. Lee, A. Alaghi, J. P. Hayes, V. Sathe, and L. Ceze, “Energy-efficient
hybrid stochastic-binary neural networks for near-sensor computing,” in
Proc. Conf. on Design, Automation & Test in Europe, ser. DATE ’17,
2017, pp. 13–18.
[5] Y. Chang and K. K. Parhi, “Architectures for digital filters using
stochastic computing,” in Proc. Int. Conf. Acoustics, Speech and Signal
Processing, May 2013, pp. 2697–2701.
[6] B. Yuan, Y. Wang, and Z. Wang, “Area-efficient scaling-free DFT/FFT
design using stochastic computing,” in Proc. Int. Symp. Circuits and
Systems, May 2016, pp. 2904–2904.
[7] P. Ting and J. P. Hayes, “Stochastic logic realization of matrix opera-
tions,” in Proc. Euromicro Conf. Digital System Design, Aug 2014, pp.
356–364.
[8] K. Kim, J. Kim, J. Yu, J. Seo, J. Lee, and K. Choi, “Dynamic energy-
accuracy trade-off using stochastic computing in deep neural networks,”
in Proc. Design Automation Conf., June 2016, pp. 124:1–124:6.
[9] S. L. Toral, J. M. Quero, and L. G. Franquelo, “Stochastic pulse coded
arithmetic,” in Proc. Int. Symp. Circuits and Systems, vol. 1, May 2000,
pp. 599–602 vol.1.
[10] P. Ting and J. P. Hayes, “On the role of sequential circuits in stochastic
computing,” in Proc. of the on Great Lakes Symposium on VLSI, 2017,
pp. 475–478.
