Energy-efficiency and accuracy of stochastic computing circuits in emerging technologies by Moons, Bert & Verhelst, Marian
  
 
 
 
 
 
 
 
 
 
Citation Bert Moons, Marian Verhelst, (2014) 
Energy-Efficiency and Accuracy of Stochastic Computing Circuits in 
Emerging Technologies 
IEEE Journal on Emerging and Selected Topics in Circuits and Systems,V4.4, 
p. 475-486, 2014 
Archived version Author manuscript: the content is identical to the content of the published 
paper, but without the final typesetting by the publisher 
Published version http://dx.doi.org/10.1109/JETCAS.2014.2361070 
Journal homepage http://jetcas.polito.it 
Author contact bert.moons@esat.kuleuven.be 
+32 (0) 16 325789 
  
 
(article begins on next page) 
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 1
Energy-efficiency and Accuracy of Stochastic
Computing Circuits in Emerging Technologies
Bert Moons, Student Member, IEEE, Marian Verhelst, Member, IEEE,
Abstract—The continued scaling of feature sizes in integrated
circuit technology leads to more uncertainty and unreliability
in circuit behaviour. Maintaining the paradigm of deterministic
Boolean computing therefore becomes increasingly challenging.
Stochastic computing (SC) processes digital data in the form
of long pseudo-random bit-streams denoting probabilities and is
therefore less vulnerable to uncertainty. When transient circuit
variations are present, SC greatly outperforms classical binary
implementations. Under these circumstances, it is impossible for
binary systems to achieve arbitrarily low error rates, while SC
can still trade-off precision for energy by using longer bit-
streams. This makes the technique a valuable alternative to
binary logic in emerging technologies with high inherent transient
uncertainty. This paper assesses the feasibility of multi-stage SC
and discusses energy and accuracy considerations in SC design.
First, the basics of SC-circuit design are discussed. Second, we
investigate three different sources of noise or uncertainty and
assess their impact on SC accuracy. Third, we propose a method-
ological design strategy to evaluate the accuracy of general,
multi-stage SC systems. The validity of this new approach is
illustrated through the design of a 1D-DCT stochastic circuit,
as part of a JPEG compression accelerator. Our analysis shows
multi-stage stochastic computing requires very long word lengths
to achieve high accuracy, resulting in low energy efficiency.
Exploiting stochastic computing’s transient error tolerance in
emerging technologies will thus have a high energy cost.
Index Terms—Stochastic Computing, accuracy, energy, mod-
elling, multi-stage
I. INTRODUCTION
Digital electronics has always relied on error-less circuit
operation. Precise Boolean functionality, defined in a deter-
ministic logical layer, is translated into a physical layer that
produces voltages. These can be interpreted as the needed
exact logic values. This abstraction has been successful, but
becomes ever more costly in emerging technologies. All forms
of noise and uncertainty in the physical layer have to be
compensated for through more complex and energy-hungry
designs with large design margins. Recently, new research is
focussing on novel ways to handle device uncertainty in a more
efficient way. A very promising class of techniques, labeled
”Stochastic Computation”, exploits probability theory to deal
with variations. Shanbhag et al. give an overview of different
techniques [1]. Stochastic Computing (SC), a computational
technique introduced by Gaines [2] [3] processes data in the
Bert Moons is with the Microelectronics and Sensors Division (MICAS),
Department of Electrical Engineering (ESAT), KU Leuven, 3001 Leuven,
Belgium (e-mail: bert.moons@esat.kuleuven.be)
Marian Verhelst is with the MICAS division, Department of Electrical
Engineering (ESAT), KU Leuven, 3001 Leuven, Belgium.
Manuscript received ..., 2014; revised ..., 2014.
form of digitized probabilities. Von Neumann [4] also looked
into probabilistic logic for unreliable components. SC has three
main advantages over conventional computing approaches.
First SCs main advantage is that its probabilistic aspect
makes it inherently tolerant to soft transient errors (such as
bit-flips and supply voltage ringing) and robust against spatial
variations. Due to this error tolerance, the logic type seems a
good alternative to binary computing in emerging technologies
suffering from high uncertainty. Figure 1 illustrates robustness
to transient circuit variations. For this example, we have
implemented a DCT-block as part of JPEG compressor, both
in a binary and in a stochastic way. Both systems are subjected
to bit-flips at a rate pt of 1e − 3. The accuracy degradation
of the binary implementation is striking, while SC can still
achieve almost perfect results. However, in order to exploit
SC’s extraordinary transient error tolerance in multi-staged
circuits, there will be an energy cost, since long bit-streams
are needed to minimize the effect of inherent faults.
Second, SC uses very low complexity building blocks,
making it suitable for massively parallel processing.
Third, there is the possibility to create logic with scalable,
progressive precision. Shortened computation can already pro-
vide an early estimate of a target value. This concept allows
trading-off precision for energy at run-time, an advantage that
can be well exploited in emerging ultra-low energy applica-
tions [5].
Although SC has been known for decades, very few physical
implementations have been made. Recently SC has been used
in LDPC decoding [6], in basic image processing systems [7]
[8], in fault-tree analyses [9] [10] and in filters [11] [12].
Alaghi and Hayes [13] and Qian and Riedel [14] [15] have
(a) (b)
Fig. 1. JPEG compression using a (a) binary and (b) stochastic (L = 216)
DCT under transient variations with bit-flip rate pt = 1e − 3. Compression
rate (CR) and root-mean-square-error of the ideal picture (RMSE) are used
as formal performance measures.
0000–0000/00$00.00 c© 2014 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be
obtained from the IEEE by sending an email to pubs-permissions@ieee.org.
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 2
proposed synthesis approaches for classes of combinational
circuits, hereby enabling a formal approach to generate com-
plex and in some cases reconfigurable arithmetic functions.
Previous research however only considers SC circuits with
only a few (< 3) stages.
This work focusses on multi-stage SC. The paper’s main
novelties are the following. We link the advantages of SC to
emerging technologies. We present a systematic breakdown
of the three different types of errors in SC and analyse the
inherent signal loss in multiple stages. From this approach we
derive a formal design methodology for multi-stage circuits.
We use this methodology to implement the first complex 1D-
DCT circuit and use it as a basic block in a JPEG encoder. This
circuit is used to assess SC’s accuracy in multi-stage circuits.
Finally we show multi-stage stochastic computing circuits
require very long data streams to achieve high accuracy,
resulting in high energy dissipation. We quantify the system
level energy consumption, both in a 40nm CMOS and a 26nm
TFET technology.
This paper is organized as follows. Section II gives an
overview of stochastic numbers, arithmetic blocks and low
level design of stochastic circuits. We link the usage of SC to
circuit design in emerging technologies. Section III discusses
different sources of variation and noise in digital systems:
inherent, spatial and transient uncertainty. This section also
analyses the performance in terms of accuracy of a single
stage SC multiplier and compares it to an equivalent binary im-
plementation, both under influence of different noise sources.
Section IV discusses the reasons for decreasing signal power
in multi-stage SC. This meticulous analysis allows proposing
a methodological strategy in section V, which can be used to
design new SC systems and evaluate the accuracy of existing
circuits. Section VI concludes this work.
II. STOCHASTIC COMPUTING
A. Basic theory
Stochastic numbers (SN) are bit-streams containing N1
1s and N0 0s denoting the unipolar (UP) number p =
N1/(N1 + N0). Since p will always lie in the real-number
interval [0,1], it can be interpreted as the probability that
the bit-stream X outputs a 1 at Xi: p = P (Xi = 1). A
bipolar (BP) interpretation of the bit-stream is possible by
transforming p onto the [−1, 1] interval (s = 2p − 1). The
precision of the stochastic number is determined by the length
of the bit-stream. A bit-stream of (L = 256 = 28) bits has a
maximal theoretical precision of 8 binary bits. The most basic
example of stochastic computing is given in figure 2. This
figure illustrates stochastic UP multiplication. Multiplication
in the UP format can be implemented with an AND-gate.
The AND-gate of figure 2 multiplies the sequential bit-stream
AND
(3/6) 0,1,1,0,1,0 0,0,0,0,1,0 = (1/6)(2/6) 0,0,0,1,1,0
Fig. 2. Example of UP stochastic multiplication with an AND-gate.
x = 0, 1, 1, 0, 1, 0 with stream y = 0, 0, 0, 1, 1, 0. Stream x
represents real number 3/6 since three out of six bits are
a logical 1. The result of this computation is 1/6, which is
obviously correct. If streams x and y are correlated, the result
could deviate.
A more typical SC system exists out of a binary-to-
stochastic (BTS) conversion unit, stochastic arithmetic and a
stochastic-to-binary (STB) converter (figure 3f and 3g). The
BTS unit can be easily implemented using LFSR pseudo
random number generators [16] [17] [18]. These can be proven
to generate near-exact approximations of the wanted binary
input value. For conversion from SC to binary a simple counter
suffices. The used stochastic arithmetic gate depends on the
number interpretation. Multiplication can be done by using an
AND-gate in the UP format (figure 2 and 3a), or an XNOR in
the BP format (figure 3b). Scaled addition can be implemented
using a MUX-gate in both cases [16] by driving the selector
input with a stochastic number p (figure 3e). The resulting
output will be px × (p) + py × (1 − p). Using p = 1/2 thus
outputs (px + py)/2. The INV-gate implements (1− p) in the
UP and (−s) in the BP format (figure 3c). More complex gates
such as comparators and linear gain functions are nontrivial
in SC (in contrast to binary logic) and can be implemented
using the synthesis approaches from [13] and [14] or by
using an FSM-based system (figure 3d) [19]. Basic blocks
for stochastic division and stochastic square roots have also
been presented [20].
B. Circuit level aspects and comparison with binary
Classical binary systems can be pipelined to increase
throughput. This is not possible in SC systems, since all op-
erations are single stage and the computing style is inherently
sequential. This is a big disadvantage of SC which makes it
difficult to achieve high throughput. The energy and accuracy
of SC systems can be compared to standard binary for the
same overall delay. The total delay D of a multi-stage binary
x
y
Z= (x+y)/2 (UP, BP)
0
1
Random no.
generator
<
A
BBinary 
number N
Stochastic 
number p
Binary 
counterClk
Clk
Binary 
number N
k
k
k1 1
x
y
XNOR
MUX
(a)
(f)
(e)
(g)
x
Z = x y (UP)
y
AND
Z = 1-x (UP)
= -x   (BP)
x
(d)
SN/2 SNS1 SN/2-1
Z=0 Z=1
Z= x y (BP)(b) (c)
Z = k  x (BP)
z z z
z
INV
1/2
Fig. 3. Examples of basic SC arithmetic gates. (a) Bipolar multiplier (b) FSM
based linear gain function (c) Scaled adder (d) Binary to stochastic converter
(e) Stochastic to binary converter.
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 3
and SC system can be summarized as:
Dbin =
S
fbin
(1)
DSC =
L
fSCP
+
(S − 1)
fSC
(2)
where fSC is the SC clock frequency, fbin is the binary clock
frequency, S is the number of stages, L is the number length
and P is the degree of parallelization. To reduce energy and
total delay, SC arithmetic functions can be parallelized with a
certain factor P . This implies multiplying the amount of gates
by P . Each gate will then compute a shorter bit-stream of
length L/P , hereby reducing the total time for computation.
This can be done with little overhead due to the typically low
gate area.
The overall delay increases linearly with S, both in the
binary and the SC case. However, since L/P will dominate
over S−1 in most cases, the total SC delay will not be a strong
function of S. Due to the short data paths of SC gates, fSC can
be much higher than fbin. Therefore it is possible to achieve a
low overall delay DSC , even if very long L are used. It is clear
from previous equations, that equal delay can be reached after
fewer stages, if the parallelization degree P increases or if the
frequency ratio fSC/fbin increases. However, P is limited by
area constraints, since higher P means an increased number of
used gates. fSC is limited by energy considerations. If fSC is
required to be very high, the system’s minimal supply voltage
will have to increase. This results in higher energy dissipation.
The energy dissipation in a SC-gate can always be modelled
as:
ESC = k(fSC , α, V, C) · L (3)
where k is the energy per bit-operation, a function of the
required SC operation frequency fSC , supply voltage V ,
circuit activity α and the switching capacitance C. L is the
stochastic number length. A SC system uses long bit-streams
to achieve high accuracy (section III-A) and thus requires
high clock-speeds to reach a certain computing delay. Despite
this need of high fSC , the SC supply voltage V can still be
well below the nominal voltage. This is due to the usage of
extremely short data paths. The parallelization degree P has
no direct influence on the energy dissipation, it only influences
delay and area.
For a real design using SC-logic we assume the overall
system delay to be specified by the application. The number of
stages is also fixed and determined by the circuit architecture.
The stream length L will be determined by the required
accuracy as explained below. P minimizes SC delay within
the area of the equivalent binary system. fSC is then chosen
in order to minimize energy. If it is too high, the energy per
bit-operation k will increase, leading to a low global energy-
efficiency.
C. Stochastic computing in emerging technologies
SC’s tolerance to transient errors makes it an interesting
logic type for emerging technologies. Current research indi-
cates future integrated circuits will suffer from reduced noise
margins or cycle-to-cycle variations, making digital systems
more sensitive to random telegraph noise (RTN) or radiation.
All these effects can result in transient errors. We give three
examples from literature.
First, sub-22nm CMOS suffers from increased voltage scal-
ing. Using lower supply voltages reduces noise margins and
increases the relative effect of resistive and inductive supply
drops [21]. In general, the soft error rate increases when supply
voltage is lowered [22].
Second, resistive RAM technologies such as HfO2 RRAM
show intrinsic cycle-to-cycle switching variability [23]. Since
RRAM allows creating non-volatile flip-flops [24], these
switching variations might not only cause transient errors in
memory, but also in arithmetic circuits.
Third, new narrow band-gap devices such as tunnel-FETs
(TFET) or graphene nanoribbon FETs [25] also suffer from
reduced noise margins. They are attractive for channel replace-
ment due to their mobility enhancement compared to silicon,
but are more sensitive to RTN because of their narrow band-
gaps [26] [27] [28]. Simulations on a ring oscillator show
graphene nanoribbon FETs have a 25 × −144× advantage
in EDP compared to silicon implementations [25]. Modelling
from [29] and [30] shows the delay in CMOS increases about
two orders of magnitude more than the delay in TFETs for
the same voltage scaling (0.6-0.2V). TFET thus has high
performance at very low power. This allows better exploitation
of SC’s ultra-short data paths; voltage can be scaled further,
while maintaining performance. Figure 4 illustrates that the
reviewed emerging technologies can be a good match with
SC.
D. Multi-stage Stochastic Circuits
SC is known for its tolerance of soft transient circuit
variations. Previous work has proven circuits with a few stages
can achieve very good accuracy and energy-efficiency. The
implementation of a stochastic edge detector in [7] is an
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
Transient error rate
En
e
rg
y 
pe
r 
bi
t-o
pe
ra
tio
n
 
k 
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
                                                                      
                                                                     
Current CMOS
Advanced CMOS
TFET
Graphene based FET
RRAM
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
                                                                     
No SC potential
SC potential
Fig. 4. Integrated circuit technologies in the Energy-Transient Error Rate
space. Current CMOS is not suitable for SC due to its high energy per bit-
operation k (section III, V-C). Emerging technologies, with low k and high
transient error rate, can benefit from stochastic computing.
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 4
example of SC outperforming a binary implementation. But,
there has been no previous research on more general multi-
stage circuits. Due to SC’s randomness and the computation of
correlated bit-streams, SC is inherently inaccurate, a problem
which becomes more stringent in systems with more stages.
There are two combined effects in multi-stage SC circuits.
First, inherent noise in stochastic outputs is binomially dis-
tributed. This noise is high, compared to binary, where inherent
inaccuracy is caused by quantization errors. Noise due to
spatial and transient circuit variations also exists, but has little
impact on SC’s accuracy. Second, stochastic signal power
tends to decrease after multiple stages. The combination of
these two effects (high noise and decreasing signal power)
leads to a decreasing SNR in multi-stage circuits, and thus to
a decreasing accuracy. We will elaborate on these effects in
sections III, IV and V.
III. NOISE IN STOCHASTIC COMPUTING
There are three major sources of errors in advanced technol-
ogy digital computations: errors inherent to the used logic type
(type I), errors due to static spatial circuit variations (type II)
and errors due to dynamic transient circuit variations (type III).
The three types of errors discussed are independent random
processes, their effects are additive and should be combined
to evaluate global performance.
Examples of type I errors are quantization faults in the
binary logic type, and faults due to stochastic correlation in
the SC logic type. Interestingly, Alaghi and Hayes [31] try
to exploit this correlation to create new ad-hoc stochastic
functions. Type II errors stem from spatial circuit variations.
These are variations that are random in space, but fixed in time,
such as random doping fluctuations or any kind of inter- or
intra-die variations. These are already omnipresent in current
transistor technologies and will become more important in
future technologies. Since the critical path of a SC multiplier
is fixed and very short (in contrast to the critical path of a
binary multiplier), it is expected that the influence of spatial
variations on SC output accuracy is limited (see III-D2). Type
III errors stem from fast transient circuit variations. These are
variations that are random in time and space, such as random
bit-flips, radiation effects, or supply-voltage ringing. In current
technologies spatial variations are still the dominant source
of uncertainty, but transient variations are becoming more
important in more advanced CMOS or in emerging post-Si
technologies when dopant levels and voltage headroom further
decreases. As the following paragraphs will indicate, SC’s
probabilistic aspect makes it less vulnerable to this type of
variations than binary systems.
A. Type I: inherent noise
Practical implementations of SC circuits use LFSR random
number generators for their binary to stochastic (BTS) trans-
formation. These generate numbers that can be guaranteed
to be near-exact [16]. The variance of any SN after the
BTS-generator will be zero. However, uncontrollable corre-
lation between two computed stochastic numbers will still
randomize the SN’s in SC circuitry. Afer several stages, the
X1
X2
0
1
Stage 1
Stage 20
1 0
1
Stage S...
...
0
1
0
1
Xi-1
Xi
...
...
...
(a)
−1 −0.6 −0.2 0.2 0.6 10
0.2
0.4
06
0.8
1
Output Value [−]
N
or
m
al
iz
ed
 S
ta
nd
ar
d 
de
via
tio
n[−
]
 
 
simulations stage 1
simulations stage 5
ideal binomial
zero variance input
(b)
Fig. 5. SC output values are binomially distributed after several stages.
(a) Multi-stage SC test setup. (b) Computed standard deviation after several
stages.
zero variance LFSR-generated number will converge into a
binomial distributed number. To illustrate this, we explicitly
simulate variance propagation. Figure 5a shows our set-up
existing out of S stages of SC adder-circuits. Note that this
test-circuit represents the same circuit as path I in the DCT
implementation (figure 9), in which S equals three.
Figure 5b shows the variance at the outputs of the different
stages and compares it with the variance of a binomial
distributed process. This shows that the binomial distribution is
indeed a good approximation and suitable for first order accu-
racy analysis, even when pseudo-random LFSR-generators are
used. If ideal random number generators are used, the binomial
approximation will be exact. Ma [32] discusses the modelling
of inherent noise in SC as a hypergeometric process. This is
however not needed for our first order analysis.
Using the binomial model, the variance of a stochastic
number is then a function of its UP value p and the SN length
L:
σ2UP =
σ2BP
4
=
p(1− p)
L
(4)
Note that σ2 is maximal where p = 0.5 and s = 0. The
noise is thus largest where BP numbers have the lowest signal
power. The only way to reduce this variance is by using longer
stochastic numbers. If uniformly distributed input values are
assumed, the mean noise power across all possible input values
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 5
can be computed as:
σ2mean−UP =
∫ 1
0
p(1− p)
L
dp =
1
6L
(5)
If basic stochastic blocks (AND, MUX, XOR, XNOR, INV)
are used, the noise remains binomial due to correlation effects.
If FSM-based constant-multiplicand [19] (with multiplicand
c > 1) blocks are used, the variance scales accordingly.
For example, a constant multiplication gives σout = c · σin
if c > 1. Using this block thus leads to noise that is even
higher than binomial.
These result should be compared to the inherent inaccuracy
in binary systems:
σ2mean−binary =
LSB2
12
=
1
12 · 22n (6)
where n is the binary wordlength. From this first order
estimation, it already becomes clear that very long bit-streams
are needed to achieve the same absolute noise power.
σ2mean−UP = σ
2
mean−binary ⇐⇒ L = 22n+1 (7)
B. Type II: noise due to spatial circuit variations
Spatial circuit variations such as random doping fluctuations
may also cause inaccuracy in digital systems. Circuit designers
cope with these uncertainties by introducing extra static design
margins in the form of higher supply voltages or conservative
lay-outs. Designs in technologies with high spatial variations
therefore typically have a low energy-efficiency. In binary
systems, faults due to spatial variations should always be
prevented, since these will typically lead to timing errors on
the MSB critical path. Faults on these paths are large in
magnitude and therefore result in a high root-mean-square-
error (RMSE).
In SC, there is a possibility to trade-off energy, area and
precision. Because of SC’s sequential nature, errors due to
spatial variations will be small and on the order of LSB. A
limited introduction of errors due to these variations may be
tolerable if the associated energy-gain, due to smaller design
margins such as lower supply voltage, is sufficient. Consider
a single SC multiplier, computing a SN of length L. The
output accuracy of this single AND-gate is determined by
variations which are randomly distributed. But after production
it static and fixed in time and space. The resulting output
value is a sample of the distribution N1(µL, σL). µL is
the expected mean deviation of the ideal value p, σL is its
standard deviation. By parallellizing this computation by a
factor P = 2, a second AND-gate is introduced. Both gates
now compute SN’s of L2 = L/2. The resulting output is
determined by two samples of the random distribution:
N2 = N(µL2, σL2) +N(µL2, σL2) (8)
N2 = N(µL, σL/
√
2) (9)
If P gates are used, the distribution of the output value
becomes:
NP = N(µL, σL/
√
P ) (10)
Parallellizing can thus effectively reduce the RMSE due to
spatial variations. However, there will always be an upper limit
0 0.2 0.4 0.6 0.8 1−0.2
−0.1
0
0.1
0.2
Input Value (UP), [−]
Er
ro
r [−
]
 
 
mean
+/− σ @ P=1
+/− σ @ P=16
(a)
0 0.2 0.4 0.6 0.8 1−0.2
−0.1
0
0.1
0.2
Input Value (UP), [−]
Er
ro
r [−
]
 
 
mean
+/− σ @ P=1
+/− σ @ P=16
(b)
Fig. 6. Mean errors and standard deviations in function of the UP output
value of a SC multiplier. (a) 60% energy gain and (b) 30% energy gain due
to voltage overscaling. Only the standard deviation is a function of P, the mean
is not. The mean RMSE in (a) is 7.5% when P = 1 and 5.3% when P =∞.
Implementation (b) has respectively RMSE= 1.8% and RMSE= 0.65%
on RMSE, determined by µL. In these equations, µL and σL
are determined by the amount of spatial variations (fixed by
the supply voltage and circuit technology) and by the SN-
value. Figure 6 shows the errors due to spatial variations in a
SC multiplier (input2) under different circumstances and for
different P as a function of the input value. These plots are
generated using Monte-Carlo simulations in Spice. Figure 6a
and figure 6b show the errors when the supply voltage V is
dropped 33% and 16.5% from the error-less voltage. This is
equivalent to a drop in energy dissipation of respectively 60%
and 30%. Observe the zero mean at a 0.7 input, resulting in
0.72 ≈ 0.5. At this input value, the AND-gate ideally outputs
as many zeros as ones. The number of too-slow pull-ups will
then equal the number of too-slow pull-downs, resulting in a
zero mean deviation. Lower input values lead to a negative
mean deviation, since the number of zero’s in these output
streams is larger than the number of ones. Therefore, there
will be more failed zero to one transitions due to too slow
pull-ups, resulting in a negative error. Lower supply voltages
decrease the energy dissipation, but increase RMSE.
Since spatial variations are fixed in time, a single AND-gate
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 6
will always make the same errors after production. Faults thus
become repetitive and deterministic. It is therefore better to
tune out spatial variations completely by using higher supply-
voltages.
C. Type III: noise due to transient circuit variations
Transient circuit variations such as random bit-flips, cycle-
to-cyle variations, radiation effects or supply-voltage ringing
may cause severe faults in digital circuits. In current technolo-
gies, spatial variations are still the dominant source of uncer-
tainty, but transient variations will become more important in
the future (section II-C). System simulations allow assessing
these type of errors. In SC they can be efficiently modelled
by using an XOR-gate on every logical output node. Input
stream pa will be distorted at rate pt, where pt will be very
low. The resulting pout equals xor(pa, pt). If a bit of bit-
stream pt equals 1, the corresponding bit of stream pa will
invert. It is easy to understand that SC will not suffer greatly
from transient circuit variations. If this problem is considered
in the bipolar format, the XOR computes a multiplication
and inversion (see section II). If we transform these unipolar
numbers to their equivalent bipolar representation, the outcome
of the BP number sa = 2pa−1 distorted at a rate st = 2pt−1,
with pt = 1e− 3 can be easily computed (equation 11).
sout = xor(sa, st) = 0.998 · sa (11)
Which is an extra decrease in signal power (see section IV),
but in this case only a small distortion.
D. Quantitative comparison of noise levels in single stage SC
and binary multipliers
In order to quantitatively compare the noise in SC and
binary digital electronics due to circuit variations, we first
simulate a single stage SC multiplier as well as a standard
array multiplier (without pipelining). Both systems are equally
exposed to the previously mentioned sources of uncertainty.
Type III (transient) errors are simulated on the system level.
Type II (spatial) variations require transistor level Spice sim-
ulations.
1) Simulation set-up: Circuit simulations for SC are set-up
as follows: two random bit-streams pa and pb are multiplied
using a 40nm AND-gate. At a given clock frequency, supply
voltage is swept. For every voltage step the accuracy impact
due to spatial variations is recorded. This dependency is only
a function of the used circuit, spatial variations and frequency.
The minimal supply voltage at which no type II errors occur is
used to further assess the impact of type I and III errors. The
binary multiplier works at a low frequency fbin of 31MHz
(period = 32ns) to operate near the minimum energy point.
The SC-multiplier hence requires a much higher fSC of 496
MHz (period = 2ns) at a parallelization degree of P = 16.
All SC and binary circuit simulations are done using Spice
with 15 Monte-Carlo runs, which offer sufficient resolution
for the targeted first order analysis. To mimic more advanced
technologies with more uncertainty, extra Vt- and β-mismatch
is added using a verilog-A behavioural model. Transient circuit
variations such as bit-flips or cycle-to-cycle variations are
0 2 4 6 8 10 12
10−6
10−4
10−2
100
number of bits (binary) respectively log2[L] (stochastic) 
R
M
SE
 
 
SC without temporal variations
Binary without temporal variations
added RMSE @ pt = 1e−5
added RMSE @ pt = 1e−3
combined RMS @ pt = 1e−3
(a)
10−510−410−310−210−1100
10−1
100
101
102
103
RMSE
En
er
gy
 [fJ
]
 
 
L = 4096
n = 11
n = 1
L = 2
SC without transient variations
Binary without transient variations
pt = 1e−5
pt = 1e−3
(b)
10−510−410−310−210−1100
10−1
100
101
102
103
RMSE
En
er
gy
 [fJ
]
 
 
L = 4096
n = 11
n = 1
L = 2
SC without transient variations
Binary without transient variations
pt = 1e−5
pt = 1e−3
(c)
Fig. 7. A comparison of the effect of three different noise sources - inherent,
spatial and transient - on the output RMSE of a binary and stochastic single
stage multiplier. (a) Transient variations + inherent RMSE (b) Transient
variations+ inherent RMSE (c) Spatial variations + inherent RMSE, Avt and
Aβ are pelgrom’s constants.
modelled with an XOR-gate on every logical output node.
Input stream pa will be inverted at a distortion rate pt.
2) Simulation results: Figure 7 shows the results of our
simulations. Figure 7a shows the inherent RMSE and the
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 7
influence of transient circuit variations, both for the binary
and the SC implementation. The full lines plot inherent RMSE
versus the bitwidth n for binary systems as a function of the
stream length L for SC systems. The markers show the added
RMSE due to transient circuit variations at different rates pt
as a function of n or L. The marked-lines show the total
combined RMSE. By introducing transient errors, the achieved
RMSE will be higher for the same n or L. This figure clearly
shows that SC multipliers outperform binary as they can reach
much lower RMSE under the same circumstances, by using
longer bit-streams. This insensitivity was already illustrated
visually in figure 1.
Figure 7b further illustrates this feature by plotting the
energy consumption as a function of RMSE for binary and
stochastic implementations. Even at the relatively low transient
error rate of 1e − 5, it is impossible to achieve an RMSE
lower than 3e − 8 using a binary system. This corresponds
to a binary accuracy of 5 bits. At a pt of 1e − 3, only 2 bit
binary accuracy can be reached. The accuracy degrades further
when using larger bit-widths. This degradation is due to two
reasons. First, the number of logical/flippable nodes increases
quadratically in a binary carry-save multiplier. Second, MSB-
nodes flip at the same rate as LSB-nodes, but contribute much
more to the global RMSE. SC clearly has an advantage over
binary computing in the case of transient circuit variations. The
contribution of type III variations to the global RMSE at a flip
rate of 1e−5 is negligible. SC’s reduced hardware complexity
leads to less logical nodes. Furthermore, flipped bits always
lead to an LSB error. Low mean errors can be achieved by
using longer bit-streams, or equivalently, by investing more
energy (see section II-B and V-C).
Figure 7c plots the energy of a multiplication operation
using both the SC and the binary logic type, as a function
of achieved RMSE for different amounts of type II varia-
tions. These simulation results also take type I variations into
account. From equations 3 and 5 the quadratic relationship
between energy dissipation and noise power RMSE2 = σ2
could be predicted:
ESC =
k
6 ·RMSE2 (12)
For SC in the 40nm case, k equals 0.13fJ/bit− operation.
In the case with Avt = 5.0e09V m and Aβ = 5.0e − 9m, k
equals 0.24fJ/bit− operation. For n ≥ 2, a best fit for the
40nm binary case can be given by:
Ebin = c · (2n)1/2 (13)
Ebin = c · ( 1√
12RMSE
)1/2 (14)
where c equals 3.3. In the case with highest Avt and Aβ , c
equals 15.5. SC has no advantage over binary for any RMSE
in the 40nm case for low to moderate spatial variations. It
is clear that SC only outperforms binary multiplication in
terms of energy usage when very high RMSE are tolerated
and high spatial variations are present. High RMSE can be
allowed in some image processing applications, such as edge
detection [7]. At high RMSE, the energy usage is generally
similar between the two systems, but it rises quicker in the
binary multiplier than in SC with increasing spatial variations.
However, due to the quadratic dependence of RMSE in SC,
binary logic still performs better. Furthermore, the energy us-
age in binary can be reduced by pipelining the multiplier; this
effectively reduces the impact of type II variations on delay
and energy. Pipelining is not possible in the SC multiplier,
since its arithmetic blocks only have a single stage and are
inherently sequential.
The usage of this proposed framework allows a quick
evaluation of the single-stage energy-efficiency of SC in a
new technology if parameters k, c and pt are known. In
technologies with sufficiently low k and high pt, SC will be
preferable to binary computation. We will quantify this by
comparing an actual implementation of a binary and a SC
DCT block in 40nm CMOS and TFET in section V.
However, to do this the previous analysis is not sufficient for
multi-stage systems, since it does not incorporate the decrease
of signal power after several gates. The following sections
elaborate on this effect.
−1 −0.5 0 0.5 10
0.5
1
1.5
2
SN value [−]
PD
F 
[−]
 
 
Uniform input
output stage 1
output stage 2
output stage 3
output stage 4
output stage 5
(a)
0 1 2 3 4 510
20
30
40
50
60
Stage Number [−]
M
ea
n 
SN
R 
[dB
]
 
 
L=28
L=212
L=216
(b)
Fig. 8. SNR degradation in multi-stage SC systems. (a) Probability density
function (PDF) of stochastic numbers after several stages. (b) Signal-to-noise
power ratio (SNR) after several stages.
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 8
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-1
-1
-1
-1
-1
-1
-1
-1
d0
d1
d2
d3
-1
-1
-1
-1b1
b0
b1
b0
a
a
a
a 2
2 +
+
-1
-1
2
+
-1
2
2 +
-1
+
-1
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
X(8)
Y(1)
Y(5)
Y(3)
Y(7)
Y(2)
Y(6)
Y(4)
Y(8)
+ = MUX = (pa+pb)/2 
-1

2
= XNOR  =  pa 
= INV = - pa 
= FSM = 2 pa 
2
2
2
[27]
[12]
[19]
[8]
Fig. 9. DCT architecture based on Hou [33]. α are constants with |α| < 1. The number of each gate type is also indicated. Path I is blue, path II is red.
IV. DECREASING SIGNAL POWER IN MULTI-STAGE
STOCHASTIC COMPUTING
A second effect concerning accuracy in SC, is the decrease
of mean signal power after several stages (figure 8). This is
evident, since stochastic numbers always have an amplitude
smaller than 1. Scaled addition (sc = (sa+sb)/2) for example
will have sc < max(sa, sb), can only make numbers smaller
in amplitude and therefore decreases mean signal power
Psig = s
2. Due to this effect, the previous assumption of
uniformly distributed output values cannot be made. Figure 8a
shows how the probability density function (PDF) of the output
values changes in the multi-stage SC circuit of figure 5a. At the
first stage, all gates receive uniformly distributed input values
on the [−1, 1] interval. As these signals pass more stages, their
PDF becomes narrower around s = 0. Numbers with larger
amplitudes cease to appear and the mean signal power drops.
If multipliers are used, the signal power will drop even faster.
This analysis is in contrast to binary systems, where there is
no reduction in signal power after multiple stages.
The combination of effects III (high inherent noise at low
amplitudes) and IV (decreasing signal power after multiple
stages) will lead to low signal-to-noise (SNR) power ratios in
multi-stage systems. Hereby the noise power Pnoise is defined
by the deviation σ2sig due to the different noise sources of
section III. This SNR decrease is illustrated in figure 8b, where
the SNR in the multi-stage adder system of figure 5a is plotted
at the output of every stage. Note that the mean SNR clearly
scales with L and drops after several stages S.
V. ACCURACY EVALUATION OF MULTI-STAGE SC
CIRCUITS
Using the results of the previous sections, a general method
to evaluate SC’s accuracy in circuits suffering from type I and
type III variations can be summarized in a methodological
design strategy.
A. Methodological design strategy
To validate the accuracy of the SC system, the following
four-step methodology is proposed:
1) Evaluate the probability density function of the
output signal starting from a uniform input distribution.
This can be done numerically.
2) Compute output noise power by modelling inherent
noise in SC as a binomial process, and simulating
transient errors with an expected flip rate pt. When more
complex blocks, such as ×c (with c > 1) are used, both
the signal and the standard deviation of the noise at this
stage are multiplied by c.
3) Calculate the mean output SNR from the known
output and noise distributions.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−20
0
20
40
60
80
Stage Number [−]
m
e
a
n
 S
NR
 [d
B]
 
 
L=28
L=212
L=216
Lower limit on SNR
after 14 stages
(a)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−20
0
20
40
60
80
Stage Number [−]
m
e
a
n
 S
NR
 [d
B]
 
 
pt = 0
pt = 1e−3
pt = 1e−2
pt = 1e−1
SNR lower limit
(b)
Fig. 10. SNR after every stage in path II of figure 9. (a) SNR for different
L, no transient variations. (b) SNR for L = 216 for different pt.
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 9
(a) (b) (c) (d)
(e) (f) (g) (h)
Fig. 11. JPEG compression results. Only the L = 216 circuit achieves high accuracy. Both visually and in terms of compression ratio (CR) and RMSE.
Figures a-d show the results without transient variations. (a) Ideal binary JPEG compression. (b) SC L = 28. (c) SC L = 212. (d) SC L = 216. Figures e-h
show the results for different flip rates pt. (e) Binary compression at pt = 1e − 3. (f) SC L = 216 at pt = 1e − 3. (g) SC L = 216 at pt = 1e − 2. (h)
SC L = 216 at pt = 1e− 1. At pt = 1e− 3, the binary accuracy is already unacceptable, while the SC implementation can withstand a pt = 1e− 2. This
was predicted by our method.
4) Compare the achieved, with the required
SNR/precision and choose SC length L. This
precision will be application dependent.
B. Accuracy evaluation of 1D-DCT stochastic block
As a practical example we perform the proposed accuracy
analysis on the complex DCT of figure 9. This block is a
classical DCT implementation based on the work of Hou [33].
This DCT block contains several paths with different numbers
of stages and is a part of a JPEG encoder. In the algorithm,
quantization and inverse decoding are performed in an ideal
way. We will discuss two data-paths in the DCT block. First,
the shortest path X(1) to Y (1) (path I), indicated on figure 9
as a blue line, consisting out of three stages (S = 3). Second,
the longest path X(8) to Y (8) (path II), indicated as a red
line on the figure, consisting out of fourteen stages (S = 14).
The required output precision depends on the implemented al-
gorithm. If a full precision DCT-block is wanted, the accuracy
requirements will be high. However, in the JPEG compression
algorithm, the outputs of the 2D-DCT blocks are quantized.
Due to this quantization, the required output precision of path
I and II is respectively 4 bit (6.02 ·n ≈ 24dB SNR) and 2 bit
binary precision (12dB SNR).
Path I exists out of three stages of scaled stochastic adders.
This is the same circuit as the one from section III and
figure 5a, so the results of this analysis can be used directly. It
is clear from figure 8b, that both the L = 212 and the L = 216
implementations achieve more than 24dB SNR at S = 3.
An L = 212 implementation thus suffices for path I. Path
II is more complex and contains fourteen stages of different
SC circuitry, including ×2 blocks. Its precision requirements,
however, are somewhat weakened (12dB). Figure 10a shows
the results of our accuracy assessment design methodology on
this path when no transient circuit variations are present. The
mean SNR is computed after every stage for L = 28, L = 212
and L = 216 implementations.
The figure shows, that only the L = 216 implementation
achieves better than 12dB mean SNR at the last stage. The
L = 28 and L = 212 implementations do not suffice. Observe
that the mean SNR is still relatively high when only a few
stages are used. The fact that only the L = 216 implementation
is sufficiently accurate is further illustrated in figure 11a-d.
Where the full results of the JPEG compression with stochastic
DCT-blocks are shown and compared. The accuracy of the
implementations can be verified visually or more formally by
comparing the achieved compression ratios (CR) and RMSE
deviations of the uncompressed picture. Since the channel
Y (8) represents high spatial frequencies, high noise levels on
this channel will introduce non-existent high frequency terms
that cannot be compensated for in the JPEG quantization step.
This explains the visually noisy images. Only the accuracy of
the L = 216 implementation is reasonable and well in range of
the ideal version, as was predicted by our accuracy evaluation
method.
We can repeat the same analysis for a system suffering
from severe transient circuit variations. The deviation due to
transient errors leads to an extra decrease in signal power
and associated SNR. As long as this added effect is small
enough, the stochastic implementation will suffice. This is
again illustrated for path II in figure 10b, where SNR is
plotted as a function of the stage S and the transient bit-
flip rate pt. Even at the very high pt = 1e − 2, the SC
implementation achieves high accuracy if L = 216. SNR
only drops slightly faster than in the case without transient
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 10
TABLE I
RMSE IN JPEG FOR DIFFERENT ERROR RATES AND IMPLEMENTATIONS
Implementation L = 28 L = 212 L = 216 Binary
pt =1e-1 47.9% 47.1% 14.5% 41.7%
pt =1e-2 38.6% 12.9% 3.4% 31.1%
pt =1e-3 37.9% 12.8% 2.3% 12.1%
pt =1e-5 37.8% 12.7% 2.3% 2.5%
pt =0 37.7% 12.7% 2.3% 2%
variations. At pt = 1e− 3 the difference is only 0.24dB after
14 stages.
The effects of these transient variations are further il-
lustrated in Figure 11e-h, where the output accuracy of a
binary and a SC-implementation is compared. For example,
in figure 11e and 11f, all logical nodes are flipped at a rate of
pt = 1e−3. In the binary implementation, this leads to severe
accuracy degradation, both visually and formally in terms of
CR and RMSE. No significant accuracy degradation is seen
in the SC implementation. An L = 216 is still needed to cope
with SC’s inherent inaccuracy, but the added noise due to the
transient variations is negligible and not visible. An SC of
length L = 216 does not suffice any more when a pt = 1e−1
is applied (figure 11h), as was predicted by our method in
figure 10b. Table I gives an overview of the achieved RMSE
for every JPEG implementation. Observe that the accuracy of
SC systems is ultimately limited by their number length L. In
the case of a transient error rate pt = 1e− 3, SC is the better
choice in terms of accuracy. The binary implementation has
12.1% RMSE, while SC can still achieve high accuracy (2.3%
RMSE) at this error rate.
C. Energy evaluation of 1D-DCT stochastic block
Previous system level evaluation shows that a minimal
L = 216 is needed for JPEG compression. This high L will
lead to a high energy dissipation in the JPEG implementation
since E scales linearly with L (equation 3). To illustrate
this, we choose an example implementation operating near
the minimum energy point of the SC circuitry. If a 1D-DCT
delay of 1000ns is needed, the SC system should operate at
150 MHz at a P = 512 (equation 2). Tables II and III show
the estimated energy dissipation and circuit area for different
implementations of the DCT block, in 40nm CMOS and
26nm TFET [30]. These estimations only include the energy
usage of the combinational arithmetic, not of any flip-flops that
are needed for data-path synchronization. Added flip-flops will
TABLE II
ENERGY DISSIPATION IN DIFFERENT 40NM DCT IMPLEMENTATIONS
Implementation L = 28 L = 212 L = 216 Binary
Parallellism P = 2 P = 32 P = 512 -
area [-] 0.95e3 15.3e3 244e3 15e3
f [MHz] 150 150 150 17
Etotal [fJ] 4.5e3 72.3e3 1156.7e3 2.7e3
Erelative [-] 1.67 26.72 427.52 1
RMSE @ pt <1e-6 37.7% 12.7% 2.3% 2%
TABLE III
ENERGY DISSIPATION IN DIFFERENT TFET [30] DCT IMPLEMENTATIONS
Implementation L = 28 L = 212 L = 216 Binary
Parallellism P = 2 P = 32 P = 512 -
area [-] 0.95e3 15.3e3 244e3 15e3
frelative [] 150 150 150 17
Erelative [-] 1.01 16.16 258.6 1
RMSE @ pt =1e-2 38.6% 12.9% 3.4% 31.1%
RMSE @ pt =1e-3 37.9% 12.8% 2.3% 12.1%
RMSE @ pt =1e-5 37.8% 12.7% 2.3% 2.5%
come with a larger increase in energy usage in SC due to the
high number of needed switches.
We discuss energy consumption both in the simulated 40nm
technology and in emerging technologies.
1) 40 nm technology: Using Spice, we can simulate the
energy per bit-operation for every SC arithmetic block. Oper-
ating near the minimum energy point in a 40nm technology
they consume kMUX = 0.18, kXNOR = 0.13, k×2 = 1.41
and kINV = 0.0625 fJ/bit − operation. Observe the high
energy cost of the ×2 block [19]. This is in contrast to a binary
implementation, where this block is essentially free. Table II
shows the derived energy dissipation for the full DCT-circuit. It
is clear that even for the L = 28 version, the energy-efficiency
of SC is much lower than the binary DCT. An implementation
using words of L = 212 (at P = 32) consumes the same area
as the conventional implementation. Note that the L = 216
energy consumption is worst-case. Several paths in the DCT
can be implemented using L = 212 or L = 210, making the
energy gap smaller in a real implementation. For the realistic
pt ' 0 in 40nm CMOS, the achieved RMSE and energy
dissipation is however lowest in the binary implementation.
There is no incentive to opt for a stochastic system in this
case.
2) Emerging technologies: Relying on TFET models found
in literature [28] [30], we can estimate the corresponding
relative energy dissipation in a TFET technology (table III).
In TFET, a realistic pt lies between 1e − 2 and 1e − 5. This
estimation is based on the work of [34], which compares
delay variations under influence of RTN in advanced CMOS
technologies and on [27] which gives numbers for TFET.
According to [34], 45nm CMOS technology has less than 5%
delay variation due to RTN in large data paths. [27] shows
TFET transistors can have up to 300% ∆ID/ID due to RTN.
For error rates pt = 1e−3 and higher the binary implemen-
tation is no longer a solution, since it cannot achieve sufficient
accuracy. For these error rates only SC is accurate, albeit at a
high energy cost.
Note that the energy gap between SC and binary slightly
decreases in TFET compared to CMOS. In emerging technolo-
gies, the energy coefficients k and c are predicted to be much
lower than in current CMOS (see section II-C). The binary
minimum energy coefficient c (section III-D) will however
increase relatively compared to k. This relative increase of c is
due to two reasons. First, c increases relatively to k due to the
difference in slope of the energy-delay curves between CMOS
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 11
and TFET technologies [30]. Second, there is a relative in-
crease of c due to the higher spatial variations in TFET than in
CMOS technology [28]. More static variations lead to a higher
c, as we prove in our simulations of figure 7c. The energy
consumption of the binary multiplier increases drastically with
increasing spatial variations (c increases), while the energy
consumption of the SC implementation stays close to the
nominal value (k remains similar). Both the absolute energy
penalty and the relative energy penalty compared to binary of
SC systems will thus decrease in emerging technologies.
It is hence clear that only applications with a limited number
of circuit stages (low L) can benefit from SC technology at
high energy efficiency. SC’s main advantage in emerging tech-
nologies therefore remains its inherent robustness to transient
errors.
VI. CONCLUSION
Stochastic computing is a promising circuit technology, as it
is robust against soft errors. But, it should not be considered
a low energy alternative for binary arithmetic. This is due
to SC’s inherent accuracy loss. To analyse this accuracy, two
effects should be considered: the occurrence of noise and the
decrease of signal power in multi-stage SC systems. This paper
carefully analysed and formalized these effects, resulting in a
methodological design flow and energy-efficiency estimation
methodology for multi-staged stochastic circuits.
This paper categorizes and discusses three types of noise.
First, SC is inherently inaccurate due to randomization effects,
even if constant, near-exact number generators are used. We
demonstrate that after several stages, the inherent noise in
SC can be modelled as a binomial process, which has noise
levels that are much higher than the inherent quantization
noise in binary systems. Second, spatial circuit variations can
lead to errors. They can be tuned out by carefully balancing
the systems supply voltage. Third, there are transient circuit
variations. This type of noise only leads to very limited
distortion in SC, while it strongly affects traditional binary
computation. When transient circuit variations are present, SC
will greatly outperform binary implementations. The paper
further explained how multi-stage SC circuits decrease mean
signal power and that variance is highest at low bipolar
amplitudes. This combination leads to low SNR at SC outputs.
This can be compensated by using longer bit-streams, leading
to higher energy dissipation.
To correctly assess these combined effects, we formalized
this noise and signal assessment towards a multi-stage SC
design methodology. The methodology has been validated and
tested on a multi-stage DCT block as part of a JPEG encoder.
This analysis shows stochastic computing can be an alterna-
tive to binary in emerging technologies suffering from severe
transient circuit variations. However only in applications with
a limited number of stages or low RMSE requirements this
can be achieved at limited energy penalty.
REFERENCES
[1] N. Shanbhag, “Stochastic computation,” Design Automation Conference
(DAC), 2010.
[2] R. Gaines, “Stochastic computing systems,” Advances in information
systems science, 1969.
[3] ——, “Stochastic computing,” Proc. AFIPS Spring Joint Computer
Conf., 1967.
[4] J. Von Neumann, “Probabilistic logics and the synthesis of reliable
organisms from unreliable components,” Automata studies, 1956.
[5] A. Alaghi and J. Hayes, “Fast and accurate computation using stochastic
circuits,” Design, automation and test in europe (DATE), 2014.
[6] A. Naderi, S. Mannor, M. Sawan, and W. Gross, “Delayed stochastic
decoding of ldpc codes,” IEEE transactions on Signal Processing, 2011.
[7] A. Alaghi, C. Li, and J. Hayes, “Stochastic circuits for real-time image
processing applications,” Design automation conference (DAC), 2013.
[8] P. Li and D. Lilja, “Using stochastic computing to implement digital im-
age processing,” International Conference on Computer Design (ICCD),
2011.
[9] H. Aliee and H. Zarandi, “Fault tree analysis using stochastic logic:
a reliable and high speed computing,” Reliability and maintainability
Symposium, 2011.
[10] ——, “A fast and accurate fault tree analysis based on stochastic logic
implemented on field-programmable gate array,” IEEE transactions on
reliability, 2013.
[11] Y.-N. Chang, “Architectures for digital filters using stochastic com-
puting,” International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2013.
[12] N. Saraf, K. Bazargan, D. J. Lilja, and M. D. Riedel, “Iir filters using
stochastic arithmetic,” Design, automation and test in europe (DATE),
2014.
[13] A. Alaghi and J. Hayes, “A spectral transform approach to stochastic
circuits,” International conference on computer design (ICCD), 2012.
[14] W. Qian and M. Riedel, “The synthesis of robust polynomial arithmetic
with stochastic logic,” Design Automation Conference (DAC), 2008.
[15] ——, “An architecture for fault-tolerant computation with stochastic
logic,” IEEE transactions on computers, 2011.
[16] A. Alaghi, “Survey of stochastic computing,” ACM transactions on
embedded computing, 2012.
[17] P. Jeavons et al., “Generating binary sequences for stochastic comput-
ing,” IEEE transactions on Information Theory., 1994.
[18] B. Zelkin, “Arithmetic unit using stochastic data processing,” Patent US
6,745,219 B1.
[19] B. Brown and H. Card, “Stochastic neural computation i : computational
elements,” IEEE transactions on computers, 2001.
[20] S. L. Toral et al., “Stochastic pulse coded arithmetic,” International
Symposium on Circuits and Systems (ISCAS), 2000.
[21] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De,
“Parameter variations and impact on circuit and microarchitecture,”
Design Automation Conference (DAC), 2003.
[22] A. Dixit and A. Wood, “The impact of new technology on soft error
rates,” International Reliability Physics Symposium (IRPS), 2011.
[23] A. Fantini, L. Goux, R. Degraeve, D. J. Wouter, N. Raghavan, G. Kar,
A. Belmonte, Y.-Y. Chen, B. Govoreanu, and M. Jurczak, “Intrinsic
switching variability in hfo2 rram,” IEEE international memory work-
shop (IMW), 2013.
[24] I. Kazi, P. Meinerzhagen, P.-E. Gaillardon, D. Sacchetto, A. Burg,
and G. De Micheli, “A reram-based non-volatile flip-flop with sub-vt
read and cmos voltage-compatible write,” New Circuits and Systems
Conference (NEWCAS), 2013.
[25] M. R. Choudhury, Y. Yoon, J. Guo, and K. Mohanram, “Graphene
nanoribbon fets: Technology exploration for performance and reliabil-
ity,” IEEE transactions on nanotechnology, vol. 10, no. 4, 2011.
[26] S. Datta, H. Liu, and V. Narayanan, “Tunnel fet technology: A reliability
perspective,” Microelectronics Reliability, vol. 54, 2014.
[27] M.-L. Fan, S.-Y. Yang, V. Pi-Ho, Y.-N. Chen, P. Su, and C.-T. Chuang,
“Single-trap-induced random telegraph noise for finfet, si/ge nanowire
fet, tunnel fet, sram and logic circuits,” Microelectronics Reliability,
vol. 54, 2014.
[28] U. Avci, D. H. Morris, S. Hasan, and R. Kotlyar, “Energy efficiency
comparison of nanowire heterojunction tfet and si mosfer at lg=13nm,
including p-tfet and variation considerations,” IEDM, 2013.
[29] V. Saripalli, K. A. Mishra, S. Datta, and V. Narayanan, “An energy-
efficient heterogenous cmp based on hybrid tfet cmos-cores,” Design
Automation Conference (DAC), 2011.
[30] S. Datta, R. Bijesh, H. Liu, D. Mohata, and V. Narayanan, “Tunnel tran-
sistors for energy efficient computing,” Reliability Physics Symposium
(IRPS), 2013.
[31] A. Alaghi and J. Hayes, “Exploiting correlation in stochastic circuit
design,” International Conference on Computer Design (ICCD), 2013.
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 12
[32] C. Ma et al., “Understanding variance propagation in stochastic com-
puting systems,” International Conference on Computer Design (ICCD),
2012.
[33] H. Hou, “A fast recursive algorithm for computing the dct,” IEEE
transactions on Acoustics, speech and signal processing (ICASSP), 1987.
[34] H. Luo et al., “Temporal performance degradation under rtn: evaluation
and mitigation for nanoscale circuits,” IEEE Computer Society Annual
Symposium on VLSI, 2012.
Bert Moons was born in Antwerp, Belgium, in
1991. He received the B.S. degree in electrical engi-
neering from the KU Leuven, Leuven, Belgium, in
2011, and the M.S. degree from the same university
in 2013. He joined the ESAT-MICAS laboratories
in 2013 as a research assistant after he received a
grant from the Flemish agency for innovation by
science and technology (IWT). In 2014, he received
the Resmiq student paper award at the New Circuits
and Systems conference (NEWCAS). He is currently
working towards a PhD degree on context-aware and
run-time adaptable digital circuits for error-tolerant processing in low power
applications.
Marian Verhelst received the PhD degree in elec-
trical engineering from the KU Leuven, Leuven,
Belgium in 2008. In 2005 she resided for 3 months
at the Berkely Wireless Research Centre (BWRC) at
UC Berkeley. From 2008-2011 she worked for Intel
Labs in Portland, OR, USA. In the Wireless Com-
munications Research Lab she worked on digitally-
enhanced analog and RF circuits for performance
enhancement, self-test and self-calibration. In 2012
she returned to Belgium and became a professor
at the ESAT-MICAS group of KU Leuven. Her
research group focusses on smart, self-adaptive system architectures and
circuits for ubiquitous sensing and computing.
