Energy and accuracy in multi-stage stochastic computing by Moons, Bert & Verhelst, Marian
  
 
 
 
 
 
 
 
 
 
Citation Bert Moons, Marian Verhelst 
Energy and accuracy in multi-stage stochastic computing 
IEEE New Circuits and Systems (NEWCAS), 2014,p. 197-200  
Archived version Author manuscript: the content is identical to the content of the published 
paper, but without the final typesetting by the publisher 
Published version http://dx.doi.org/10.1109/NEWCAS.2014.6934017 
Journal homepage  http://www.newcas2014.org. 
Author contact bert.moons@esat.kuleuven.be  
+32 (0) 16 321169 
  
 
(article begins on next page) 
Energy and Accuracy in Multi-Stage Stochastic Computing
Bert Moons, Marian Verhelst
Department of Electrical Engineering - ESAT, KU Leuven, Leuven, Belgium
Contact: bert.moons@esat.kuleuven.be, marian.verhelst@esat.kuleuven.be
Abstract—The continued scaling of feature sizes in integrated
circuit technology leads to more uncertainty and unreliability
in circuit behaviour. Maintaining the paradigm of deterministic
Boolean computing therefore becomes increasingly challenging.
Stochastic computing (SC) processes digital data in the form
of pseudo-random bit-streams denoting probabilities and is
therefore less vulnerable to uncertainty. When transient circuit
variations are present, SC greatly outperforms classical binary
implementations. Previous work has mainly been on SC circuits
with only a few stages. This paper assesses the feasibility of multi-
stage SC. First the reasons for decreasing accuracy in these types
of circuits are discussed. Second, we introduce a straightforward
method to evaluate the accuracy of general SC systems. Third,
the validity of this new approach is illustrated through the design
of a 1D-DCT stochastic circuit, as part of a JPEG compression
accelerator. Last, we couple the results of our analysis to low-level
energy considerations.
I. INTRODUCTION
Digital electronics has always relied on error-less circuit
operation. Precise Boolean functionality, defined in a deter-
ministic logical layer is translated into a physical layer that
produces voltages. These can be interpreted as the needed
exact logic values. This abstraction has been successful, but
becomes ever more costly. All forms of noise and uncertainty
in the physical layer have to be compensated for through more
complex and energy-hungry designs. Recently, new research is
focussing on novel ways to handle device uncertainty in a more
efficient way. A very promising class of techniques, labeled
”Stochastic Computation”, exploits probability theory to deal
with variations. Shanbhag et al. give an overview of different
techniques [1]. Stochastic Computing (SC), a computational
technique introduced by Gaines [2] processes data in the form
of digitized probabilities. SC has three main advantages over
conventional computing approaches. First SCs main advantage
is that its probabilistic aspect makes it inherently tolerant
to soft transient errors (such as bit-flips) and robust against
spatial variations. Second, it uses very low complexity building
blocks, making it suitable for massively parallel processing.
Third, there is the possibility to create logic with scalable
precision. Shortened bit-streams can provide an early estimate
of a number value. This concept allows to trade-off precision
for energy at run-time, an advantage that can be well exploited
in ultra-low energy electronics. Due to SC‘s error tolerance,
the logic type seems a good alternative for digital designs in
technologies suffering from high uncertainty. Although SC has
been known for decades, very few physical implementations
have been made. Recently SC has been used in LDPC decod-
ing [3] and in basic image processing systems [4] [5]. Alaghi
and Hayes [6] and Qian and Riedel [7] [8] have proposed
synthesis approaches for classes of combinational circuits,
hereby enabling a formal approach to generate complex and
in some cases reconfigurable arithmetic functions. Previous
research considers SC circuits with only a few (< 3) stages.
This work focusses on multi-stage SC by using a 1D-DCT
implementation as a benchmarking machine. We first discuss
reasons for the decreasing accuracy in these types of circuits.
Two inherent inaccuracy effects lead to a fast decreasing
SNR. When short stochastic numbers are used, the noise even
becomes dominant. The only way to increase SC’s accuracy is
by using longer bit-streams, evidently leading to higher energy
dissipation. In order to exploit SC’s extraordinary transient
error tolerance in multi-staged circuits, there will be an energy
cost, since long bit-streams are needed to minimize the effect
of inherent faults. Second, we propose and validate a novel
method for the accuracy analysis of these types of circuits and
directly couple the results to low-level energy considerations.
This paper is organized as follows. Section II gives an
overview of stochastic numbers and illustrates SC’s transient
error tolerance. Section III discusses the reasons for decreasing
SNR in multi-stage SC. Section IV proposes a methodological
design strategy for accuracy analysis in SC. Section V dis-
cusses the effects of previous considerations on SC’s energy
dissipation. Finally, section VI concludes this work.
II. STOCHASTIC COMPUTATION
Stochastic numbers (SN) are bit-streams containing N1 1s
and N0 0s denoting the unipolar (UP) number p = N1/(N1+
N0). Since p will always lie in the real-number interval
[0,1], it can be interpreted as the probability that the bit-
stream outputs a 1. A bipolar (BP) interpretation of the bit-
(a) Binary at pt = 1e− 3 (b) SC (L = 216) at pt =
1e− 3
Fig. 1. JPEG compression using a binary and stochastic DCT under transient
variations with bit-flip rate pt. Compression rate (CR) and root-mean-square-
error of the ideal picture (RMSE) are used as formal performance measures.
X1
X2
0
1
Stage 1
Stage 20
1 0
1
Stage S...
...
0
1
0
1
Xi-1
Xi
...
...
...
(a) Multi-stage SC adder circuit
−1 −0.6 −0.2 0.2 0.6 10
0.2
0.4
06
0.8
1
Output Value [−]
N
or
m
al
iz
ed
 S
ta
nd
ar
d 
de
via
tio
n[−
]
 
 
simulations stage 1
simulations stage 5
ideal binomial
zero variance input
(b) SC’s binomially distributed noise
−1 −0.5 0 0.5 10
0.5
1
1.5
2
SN value [−]
PD
F 
[−]
 
 
Uniform input
output stage 1
output stage 2
output stage 3
output stage 4
output stage 5
(c) Decrease of signal power
0 1 2 3 4 510
20
30
40
50
60
Stage Number [−]
M
ea
n 
SN
R 
[dB
]
 
 
L=28
L=212
L=216
(d) Decrease of SNR for multiple L
Fig. 2. SNR degradation in multi-stage SC adder circuits.
stream is possible by transforming p onto the [−1, 1] interval
(s = 2p − 1). The precision of the stochastic number is
determined by the length of the bit-stream. A bit-stream of
(L = 28) bits has a maximal theoretical accuracy of 8 binary
bits. A typical SC system exists out of a binary-to-stochastic
(BTS) conversion unit, stochastic arithmetic and a stochastic-
to-binary (STB) converter. The BTS unit can be easily imple-
mented using LFSR pseudo random number generators [9].
These can be proven to generate near-exact approximations
of the wanted binary input value. For conversion from SC to
binary a simple counter suffices. The used stochastic arithmetic
gate depends on the number interpretation. Multiplication can
be done by using an AND-gate in the UP format, or an XNOR
in the BP format. Scaled addition can be implemented using a
MUX-gate in both cases [9]. The INV-gate implements (1−p)
in the UP and (−s) in the BP format. More complex gates
such as comparators and constant multiplication functions are
non-trivial in SC (in contrast to binary logic) and can be
implemented using the synthesis approaches from [6] and [7]
or by using an FSM-based system [10].
In this paper, we implement a stochastic DCT-block as
part of a JPEG-compressor to illustrate SC’s performance in
multi-stage circuits. The DCT block (figure 3) is a classical
implementation based on the work of Hou [11]. Figure 1
illustrates SC’s robustness to transient circuit variations such
as random bit-flips or supply voltage ringing. For this example,
the DCT-block in the JPEG compressor is subjected to bit-flips
at a rate pt of 1e − 3. The performance degradation of the
binary implementation is striking, while SC can still achieve
almost perfect performance.
III. INACCURACY IN MULTI-STAGE SC CIRCUITRY
SC is known for its tolerance of soft transient circuit varia-
tions. However, due to its randomness and the computation
of correlated bit-streams, SC is also inherently inaccurate.
There are two combined effects in multi-stage SC circuits.
First, noise in stochastic outputs is binomially distributed. This
noise is high, compared to binary, where inherent inaccuracy is
caused by quantization errors. Second, stochastic signal power
tends to decrease after multiple stages. The combination of
these two effects leads to a decreasing SNR in multi-stage
circuits. We will elaborate on both effects in the following
paragraphs.
A. Inherent inaccuracy in stochastic circuitry
Practical implementations of SC circuits use LFSR random
number generators for their binary to stochastic (BTS) trans-
formation. These generate numbers that can be guaranteed to
be near-exact [9]. The variance of any SN after the BTS-
generator will be zero. However, correlation effects will still
randomize the SN’s in SC circuitry. The zero variance LFSR-
generated number will then be transformed into a binomial
distributed number. To illustrate this, we explicitly simulate
variance propagation. Figure 2a shows our set-up existing
out of S stages of SC adder-circuits. Note that this test-
circuit represents the same circuit as path I in the DCT
implementation (figure 3). In this path, S equals 3.
Figure 2b shows the variance at the outputs of the different
stages and compares it with the variance of a binomial
distributed process. This shows that the binomial distribution
is indeed the best fit with our simulations and suitable for
first order accuracy analysis, even when non-ideal LFSR-
generators are used. If ideal random number generators are
used, the binomial approximation will be exact. The variance
of a stochastic number is then a function of its UP value p
and the SN length L:
σ2UP =
σ2BP
4
=
p(1− p)
L
(1)
Note that σ2 is maximal where p = 0.5 and s = 0. The
noise is thus largest where BP numbers have the lowest signal
power. The only way to reduce this variance is by using longer
stochastic numbers. If uniformly distributed input values are
assumed, the mean noise power can be computed as:
σ2mean−UP =
∫ 1
0
p(1− p)
L
dp =
1
6L
(2)
If stochastic multipliers (XNOR) are used, the noise remains
binomial due to correlation effects. If FSM-based constant-
multiplicand (with multiplicand c > 1) blocks are used, the
variance scales accordingly. For example, a constant multipli-
cation gives σout = c · σin if c > 1. Using this block thus
leads to noise that is even higher than binomial.
These results should be compared to the inherent inaccuracy
in binary systems due to quantization noise:
σ2mean−binary =
LSB2
12
=
1
12 · 22n
(3)
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-1
-1
-1
-1
-1
-1
-1
-1
d0
d1
d2
d3
-1
-1
-1
-1b1
b0
b1
b0
a
a
a
a 2
2 +
+
-1
-1
2
+
-1
2
2 +
-1
+
-1
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
X(8)
Y(1)
Y(5)
Y(3)
Y(7)
Y(2)
Y(6)
Y(4)
Y(8)
+ = MUX = (pa+pb)/2 
-1

2
= XNOR  =  pa 
= INV = - pa 
= FSM = 2 pa 
2
2
2
[27]
[12]
[19]
[8]
Fig. 3. DCT architecture based on Hou [11]. α are constants with |α| < 1.
The number of each gate type is also indicated. Path I is blue, path II is red.
where n is the binary wordlength. From this first order
estimation, it already becomes clear that very long bit-streams
are needed to achieve the same absolute noise power.
σ2mean−UP = σ
2
mean−binary ⇐⇒ L = 2
2n+1 (4)
B. Decreasing signal power in SC
A second effect concerning accuracy in SC, is the decrease
of mean signal power after several stages. This is evident, since
stochastic numbers always have an amplitude smaller than 1.
Multiplying (sa · sb) and scaled addition ((sa + sb)/2) can
only make numbers smaller in amplitude and thus decrease
signal power. Due to this effect, the previous assumption of
uniformly distributed output values can not be made. Figure 2c
shows how the probability density function (PDF) of the output
values changes in the multi-stage SC circuit of figure 2a. At the
first stage, all gates receive uniformly distributed input values
on the [−1, 1] interval. As these signals pass more stages, their
PDF becomes narrower: numbers with larger amplitudes cease
to appear and the mean signal power drops. If multipliers are
used, the signal power will drop even faster.
This analysis is in contrast to binary systems, where there
is no reduction in signal power after multiple stages.
C. SNR in multi stage circuits
The combination of effects III-A (high inherent noise at
low amplitudes) and III-B (decreasing signal power after
multiple stages) will lead to low signal-to-noise (SNR) ratios
in complex systems. This is illustrated in figure 2d, where the
SNR in the multi-stage adder system of figure 2a is plotted
at the output of every stage. Note that the mean SNR clearly
scales with L and drops after several stages S.
IV. SC ACCURACY EVALUATION
A. Methodological design strategy
Using the results of the previous section, a general method
to evaluate SC’s accuracy can be summarized as follows:
1) Evaluate the output distribution starting from a uni-
form input distribution. This can be done numerically.
2) Compute output noise power by modelling inherent
noise in SC as a binomial process. When more complex
blocks, such as ×c are used, both the signal and the stan-
dard deviation of the noise at this stage are multiplied
by c.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−20
0
20
40
60
80
Stage Number [−]
m
e
a
n
 S
NR
 [d
B]
 
 
L=28
L=212
L=216
Lower limit on SNR
after 13 stages
Fig. 4. SNR After every stage in path II of figure 3
3) Calculate the mean output SNR from the known
output and noise distributions.
4) Compare the achieved, with the required
SNR/precision and choose SC length L. This
precision will be application dependent.
B. Accuracy evaluation of 1D-DCT stochastic block
As a practical example we perform an accuracy analysis
on the complex DCT of figure 3. This DCT block contains
several paths with different numbers of stages. We will discuss
two data-paths in the DCT block. First, the shortest path
X(1) to Y (1) (path I), indicated on figure 3 as a blue line.
Second, the longest path X(8) to Y (8) (path II), indicated as
a red line on the figure. The required output precision depends
on the implemented algorithm. If a full precision DCT-block is
wanted, the accuracy requirements will be high. However, in
the JPEG compression algorithm, the outputs of the 2D-DCT
blocks are quantized. Due to this quantization, the required
output precision is not equal to 8 binary bits. The outputs of
path I and II respectively require 4 bit (6.02 ·n ≈ 24dB SNR)
and 2 bit binary precision (12dB SNR).
Path I exists out of three stages of scaled stochastic adders.
This is the same circuit as the one from section III and
figure 2a, so the results of this analysis can be used directly. It
is clear from figure 2d, that both the L = 212 and the L = 216
implementations achieve more than 24dB SNR at S = 3. An
L = 212 implementation thus suffices for path I.
Path II is more complex and contains fourteen stages of
different SC circuitry, including ×2 blocks. Its precision
requirements, however, are somewhat weakened (12dB). Fig-
ure 4 shows the results of our general method on this path.
The mean SNR is computed after every stage for L = 28,
L = 212 and L = 216 implementations. The figure shows,
that only the L = 216 implementation achieves better than
12dB mean SNR at the last stage. The L = 28 and L = 212
implementations do not suffice. Observe that the mean SNR
is still relatively high when only a few stages are used. These
results are further illustrated in Figure 5. Where the full results
of the JPEG compression with stochastic DCT-blocks are
shown and compared. The performance of the implementations
can be verified visually, or more formally by comparing the
achieved compression ratios (CR) and RMSE deviations of
the uncompressed picture. Since the channel Y (8) represents
(a) Ideal binary JPEG
compression
(b) SC L = 28 JPEG (c) SC L = 212 JPEG (d) SC L = 216 JPEG
Fig. 5. JPEG compression results. Only the L = 216 circuit achieves high performance. Both visually and in terms of compression ratio (CR) and RMSE.
high spatial frequencies, high noise levels on this channel will
introduce non-existent high frequency terms that cannot be
compensated for in the JPEG quantization step. This explains
the visually noisy images. Only the L = 216 performance is
reasonable and well in range of the ideal implementation, as
was predicted by our accuracy evaluation method.
V. ENERGY CONSIDERATIONS
Section III has shown that SC requires long numbers if it is
to achieve high accuracy. A lot of applications not only require
high precision, but also maximal energy-efficiency. SC should
thus not only be as accurate, but also as energy-efficient as
binary arithmetic if it aspires to be a valid alternative. The
energy dissipation in any SC-gate varies linearly with the bit-
stream length L and can be modelled as:
ESC = k(fSC , V, α, gate type) · L (5)
where k is the gate-energy per bit-operation, a function of
the required SC operating frequency fSC , the supply voltage
V , the circuit activity α and the gate type. Since L should
be very high in multi-stage SC circuits (L = 216 = 65536
in the JPEG-DCT), the required energy for computation will
also be high. To illustrate this, we choose an example im-
plementation operating near the minimum energy point of
the SC circuitry in a 40nm technology. If a 1D-DCT delay
of 1000ns is needed, the SC system should operate at 150
MHz at a parallelization degree of P = 512 (P bit-streams
of length LP = L/P are computed at the same time). Table
I shows the estimated energy dissipation and circuit area for
the different implementations. These estimations only include
the energy usage of the combinational logic, not of any flip-
flops that are needed for data-path synchronization. Operating
near the minimum energy point, the SC arithmetic blocks
consume kMUX = 0.18, kXNOR = 0.13, k×2 = 1.41 and
kINV = 0.0625 fJ/bit− operation. It is clear that even for
the L = 28 version, the energy consumption of SC is higher.
VI. CONCLUSION
Stochastic computing is a promising circuit technology, as
it is robust against transient soft variations. But, it should not
be considered a low energy alternative for binary arithmetic.
Multi-stage SC circuits consume a lot of energy because they
TABLE I
ENERGY DISSIPATION IN DIFFERENT 40NM DCT IMPLEMENTATIONS
Implementation L = 28 L = 212 L = 216 Binary
Parallelism P = 2 P = 32 P = 512 -
f [MHz] 150 150 150 17
Etotal [fJ] 4.5e3 72.3e3 1156.7e3 2.7e3
need long bit-streams to compensate for their inherent inac-
curacy. This inaccuracy is caused by two effects. First, SC’s
randomness causes noise, which is binomially distributed. In
the BP format, numbers with small amplitudes suffer from the
highest noise power. Second, stochastic signal power tends to
decrease after several stages. The combination of these two
effects leads to a strong decrease of mean SNR. The only way
to increase this SNR is by using longer bit-streams, leading
to higher energy dissipation. We propose a methodological
design strategy to determine stochastic accuracy and length L
and do first order energy estimations in multi-stage SC circuits.
When transient circuit variations are present, SC’s perfor-
mance will not degrade, as opposed to binary logic. Although
SC-logic’s energy cost is high, it might be the only alternative
when transient variations are present, or in emerging tech-
nologies with inherent transient uncertainty and low energy
per bit-operation.
REFERENCES
[1] N. Shanbhag et al., “Stochastic computation,” Design Automation Con-
ference (DAC), 2010.
[2] R. Gaines, “Stochastic computing systems,” Advances in information
systems science, 1969.
[3] A. Naderi, S. Mannor, M. Sawan, and W. Gross, “Delayed stochastic
decoding of ldpc codes,” IEEE tran. Signal Proc., 2011.
[4] A. Alaghi, C. Li, and J. Hayes, “Stochastic circuits for real-time image
processing applications,” Design Automation Conference (DAC), 2013.
[5] P. Li and D. Lilja, “Using stochastic computing to implement digital
image processing algorithms,” ICCD, 2011.
[6] A. Alaghi and J. Hayes, “A spectral transform approach to stochastic
circuits,” ICCD, 2012.
[7] W. Qian and M. Riedel, “The synthesis of robust polynomial arithmetic
with stochastic logic,” Design Automation Conference (DAC), 2008.
[8] W. Qian, M. Riedel et al., “An architecture for fault-tolerant computation
with stochastic logic,” IEEE transactions on computers, 2011.
[9] A. Alaghi et al., “Survey of stochastic computing,” ACM TECS, 2013.
[10] B. Brown and H. Card, “Stochastic neural computation i : computational
elements,” IEEE transactions on computers, 2001.
[11] H. Hou, “A fast recursive algorithm for computing the dct,” ASSP, 1987.
