VLSI Implementation of Deep Neural Network Using Integral Stochastic
  Computing by Ardakani, Arash et al.
1VLSI Implementation of Deep Neural Network
Using Integral Stochastic Computing
Arash Ardakani, Student Member, IEEE, François Leduc-Primeau, Naoya Onizawa, Member, IEEE,
Takahiro Hanyu, Senior Member, IEEE and Warren J. Gross, Senior Member, IEEE
Abstract— The hardware implementation of deep neural net-
works (DNNs) has recently received tremendous attention: many
applications in fact require high-speed operations that suit a
hardware implementation. However, numerous elements and
complex interconnections are usually required, leading to a large
area occupation and copious power consumption. Stochastic com-
puting has shown promising results for low-power area-efficient
hardware implementations, even though existing stochastic al-
gorithms require long streams that cause long latencies. In this
paper, we propose an integer form of stochastic computation and
introduce some elementary circuits. We then propose an efficient
implementation of a DNN based on integral stochastic computing.
The proposed architecture has been implemented on a Virtex7
FPGA, resulting in 45% and 62% average reductions in area and
latency compared to the best reported architecture in literature.
We also synthesize the circuits in a 65 nm CMOS technology
and we show that the proposed integral stochastic architecture
results in up to 21% reduction in energy consumption compared
to the binary radix implementation at the same misclassification
rate. Due to fault-tolerant nature of stochastic architectures, we
also consider a quasi-synchronous implementation which yields
33% reduction in energy consumption w.r.t. the binary radix
implementation without any compromise on performance.
Index Terms— Deep neural network, machine learning, hard-
ware implementation, integral stochastic computation, pattern
recognition, Very Large Scale Integration (VLSI).
I. INTRODUCTION
Recently, the implementation of biologically-inspired ar-
tificial neural networks such as the Restricted Boltzmann
Machine (RBM) has aroused great interest due to their high
performance in approximating complicated functions. A va-
riety of applications can benefit from them, in particular
machine learning algorithms. They can be split in two phases,
which are referred to as learning and inference phases [2]. The
learning engine finds a proper configuration to map learning
input data into their desired outputs, while the inference engine
uses the extracted configuration to compute outputs for new
data.
Deep neural networks, especially Deep Belief Networks
(DBN), have shown state-of-the-art results on various com-
puter vision and recognition tasks [3]–[8]. DBN can be formed
by stacking RBMs on top of each other to construct a deep
network, as shown in Fig. 1 [4]. RBMs used in DBN are pre-
trained using Gradient-based Contrastive Divergence (GCD)
algorithms, followed by gradient descent and backpropagation
algorithms for classification and fine-tuning the results [4], [5].
A preliminary version of this paper was published in [1].
Input layer
Layer 1
Layer N
Output layer
W1
WN+1
Fig. 1. A N-layer DBN where W and N denote the weights of each layer
and number of layers respectively.
In the past few years, general purpose processors have been
mainly used for software realization of both training and infer-
ence engines of DBN. However, large power consumption and
high resource utilization have pushed researchers to explore
ASIC and FPGA implementations of neural networks. Rapid
expansion of devices and sensors connected to the internet of
things (IoT) allows to perform the training procedure once on
cloud servers equipped with Graphics Processing Unit (GPU),
and extract weights for inference engine usage through the
IoT platforms. The inference engine can then be implemented
using ASIC or FPGA platforms.
DBNs are constructed of multiple layers of RBMs and a
classification layer at the end. The main computation kernel
consists of hundreds of vector-matrix multiplications followed
by non-linear functions in each layer. Since multiplications
are costly to implement in hardware, existing parallel or semi-
parallel VLSI implementations of such a network suffer from
high silicon area and power consumption [9]. The nonlinearity
function is also implemented using Look-Up Tables (LUTs),
requiring large memories. Moreover, hardware implementation
of this network results in large silicon area: this is caused by
the connections between layers, that lead to severe routing
congestion. Therefore, an efficient VLSI implementation of
DBN is still an open problem.
Recently, Stochastic Computing (SC) has shown promising
results for ultra low-cost and fault-tolerant hardware imple-
mentation of various systems [10]–[19]. Using SC, many
computational units have simple implementation. For instance,
ar
X
iv
:1
50
9.
08
97
2v
2 
 [c
s.N
E]
  2
4 A
ug
 20
16
2using unipolar SC, the multiplication and addition are imple-
mented using an AND gate and a multiplexer, respectively
[20], [21]. However, the multiplexer-based adder introduces
a scaling factor that can cause a precision loss [22], resulting
in the failure of SC for deep neural networks, which require
many additions. An OR gate can provide a good approximation
to addition if its input values are small [21]. However, using
OR gates to perform addition in DBNs results in a huge
misclassification error compared to its fixed-point hardware
implementation. Therefore, an efficient stochastic implemen-
tation that maintains the performance of DBN is still missing.
In this paper, an integral stochastic computation is in-
troduced to solve the precision loss issue of conventional
scaled-adder, while also reducing the latency compared to
conventional binary stochastic computation. It is also worth
mentioning that the proposed technique results in lower latency
compared to conventional binary stochastic computation. A
novel Finite State Machine (FSM)-based tanh function is then
proposed as the nonlinearity function used in DBN. Finally,
an efficient stochastic implementation of DBN based on the
aforementioned techniques with an acceptable misclassifica-
tion error is proposed, resulting in 45% smaller area on average
compared to the state-of-the-art stochastic architecture.
A nanoscale memory-resistor (memristor) device is a non-
volatile digital memory, which consumes substantially less
energy compared to CMOS and can be scaled to sizes below
10 nm [23]. A challenging problem with memristor devices
is the presence of significant random variations. A promising
approach for dealing with the non-determinism of memristors
is to design SC systems that are fault-tolerant [23]. In this
paper, we show that the proposed architectures can tolerate a
fault rate of up to 16% when timing violations are allowed to
occur, making them suitable for memristor devices.
The manuscript can be divided in two major parts: the pro-
posed algorithms and their hardware implementation results.
In the first part, we analyze elementary computational units.
Also, some simulation results and examples are provided to
shed light on the proposed algorithm in comparison with the
existing methods. In the second part, design aspects of a deep
neural network based on the proposed method are studied
and some implementation results under different conditions
are provided.
The rest of this paper is organized as follows. Section II pro-
vides a review of SC and its computational elements. Section
III introduces the proposed integral stochastic computation and
operations in this domain. Section IV describes the integral
stochastic implementation of DBN. Implementation results of
the proposed architecture is provided in Section V. In this
section, the performance of the stochastic implementation is
studied when the circuit is affected by timing violations. Note
that accepting occasional timing violations allows to reduce
the supply voltage, which can improve the energy efficiency of
the system. In Section VI, we conclude the paper and discuss
future research.
AND
A: 1,0,1,0,0,0,0,0 (2/8)
B: 1,0,0,1,0,1,0,1 (4/8)
Y: 1,0,0,0,0,0,0,0 (1/8)
(a)
A: 1,0,1,0,0,0,0,0 (-6/8)
B: 1,0,0,1,0,1,0,1 (0)
Y: 1,1,0,0,1,0,1,0 (0)
XNOR
(b)
Fig. 2. Stochastic multiplications using (a) AND gate in unipolar format and
(b) XNOR gate in bipolar format
II. STOCHASTIC COMPUTING AND ITS COMPUTATIONAL
ELEMENTS
In stochastic computation, numbers are represented as se-
quences of random bits. The information content of the
sequence does not depend on the particular value of each bit,
but rather on their statistics. Let us denote by X ∈ {0, 1} a bit
in the random sequence. To represent a real number x ∈ [0, 1],
we simply generate the sequence such that:
E[X] = x, (1)
where E[X] denotes the expected value of the random variable
X . This is known as the unipolar format. The bipolar format
is another commonly used format where x ∈ [−1, 1] is
represented by setting:
E[X] = (x+ 1)/2. (2)
Note that any real number can be represented in one of these
two formats by scaling it down to fit within the appropriate
interval. In this paper, we use upper case letters to represent
elements of a stochastic stream, while lower case letters
represent the real value associated with that stream. It is also
worth mentioning that a stochastic stream of a real value x is
usually generated by a linear feedback shift register (LFSR)
and a comparator. This unit is hereafter referred to as binary
to stochastic convertor (B2S) [24].
A. Multiplication In SC
Multiplication of two stochastic streams is performed us-
ing AND and XNOR gates in unipolar and bipolar encoding
formats, respectively, as illustrated in Fig. 2(a) and 2(b). In
unipolar format, the multiplication of two input stochastic
streams of A and B is computed as:
Y = AND (A,B) = A ·B, (3)
where "· · · " denotes bit-wise AND and if the input sequences
are independent, we have:
y = E[Y ] = a× b. (4)
Multiplications in bipolar format can be performed as:
Y = XNOR (A,B) = OR (A ·B, (1−A) · (1−B)) , (5)
E[Y ] = E[A ·B] + E[(1−A) · (1−B)]. (6)
3A: 1,0,1,0,1,1,1,1 (6/8)
B: 1,0,0,0,0,0,1,0 (2/8)
Y: 1,0,1,0,1,0,1,0 (4/8)
S: 1,0,0,1,0,1,0,1 (4/8)
0
1
(a)
A: 1,0,1,0,0,0,0,0 (2/8)
B: 0,1,0,0,0,1,0,1 (3/8)
Y: 1,1,1,0,0,1,0,1 (5/8)
OR
(b)
Fig. 3. Stochastic additions using (a) MUX and (b) OR gate
If the input streams are independent,
E[Y ] = E[A]× E[B] + E[1−A]× E[1−B]. (7)
By simplifying the above equation, we have:
y = 2E[Y ]− 1 = (2E[A]− 1)× (2E[B]− 1) . (8)
B. Addition In SC
Additions in SC are usually performed by using either
scaled adders or OR gates [20], [21]. The scaled adder uses a
multiplexer (MUX) to perform addition. The output of a MUX
Y is given by
Y = A · S +B · (1− S) . (9)
As a result, the expected value of Y would be (E[A]+E[B])/2
when the select signal S is a stochastic stream with probability
of 0.5, as illustrated in Fig. 3(a). This 2-input scaled adder
ensures that its output is in the legitimate range of each
encoding format by scaling it down by factor of 2. Therefore,
L-input addition can be performed by using a tree of multiple
2-input MUXs. In general, the result of an L-input scaled adder
is scaled down L times, which can decrease the precision of
the stream. To achieve the desired accuracy, longer bit-streams
must be used, resulting in larger latency.
OR gates can also be used as approximate adders as shown
in Fig. 3(b). The output Y of an OR gate with inputs A, B
can be expressed as
Y = A+B −A ·B. (10)
OR gates function as adders only if E[AB] is close to 0.
Therefore, the inputs should first be scaled down to ensure that
the aforementioned conditions are met. This type of adder still
requires long bit-streams to overcome a precision loss incurred
by the scaling factor.
To overcome this precision loss, which could potentially
lead to inaccurate results, the Accumulative Parallel Counter
(APC) is proposed in [22]. The APC takes N parallel bits as
inputs and adds them to a counter in each clock cycle of the
system. Therefore, this adder results in lower latency due to
its small variance of the sum. It is also worth mentioning that
this adder converts the stochastic stream to binary form [22].
Therefore, this adder is restricted to cases where additions
are performed to obtain the final result, or requiring an
intermediate result in binary format.
S0 S1 Sn/2-1 Sn/2 Sn-2 Sn-1 XX’
Y = 0 Y = 1
(a)
S0 S1 Sn-G-1 Sn-G Sn-2 Sn-1 XX’
Y = 1 Y = 0
(b)
Fig. 4. State transition diagram of the FSM implementing (a) tanh and (b)
exponentiation functions
Stochastic Stream X1: 1,0,1,0,1,1,1,1 (0.75)
Stochastic Stream X2: 1,1,1,0,1,0,1,1 (0.75)
(a)
 X1
 X2
 Integer stochastic stream 
S: 2,1,2,0,2,1,2,2
(b)
Fig. 5. (a) Stochastic representations of 0.75 and (b) Integer stochastic
representation of 1.5
C. FSM-Based Functions In SC
Hyperbolic tangent and exponentiation functions are com-
putations required by many applications. These functions are
implemented in the stochastic domain by using a FSM [25].
Fig. 4(a) and 4(b) show the state transition diagram of the
FSM implementing tanh and exponentiation functions. The
FSM is constructed such that
tanh
(nx
2
)
≈ E[Stanh (n,X)], (11)
exp (−2Gx) ≈ E[Sexp (n,G,X)] : x > 0. (12)
where n denotes the number of states in the FSM, G the linear
gain of the exponentiation function and Y the stochastic output
sequence. Let us define as Stanh and Sexp the approximated
functions of tanh and exp in stochastic domain. It is worth
mentioning that both input and output of the Stanh function
are in bipolar format, while the input and output of the Sexp
function are in bipolar and unipolar formats respectively.
III. PROPOSED INTEGRAL STOCHASTIC COMPUTING
A. Generation of Integer Stochastic Stream
An integer stochastic stream is a sequence of integer num-
bers which are represented by either 2’s complement or sign
and magnitude. The average value of this stream is a real
number s ∈ [0,m] for unipolar format and s ∈ [−m,m] for
bipolar format, where m ∈ {1, 2, . . . }. In other words, the
real value s is the summation of two or more binary stochastic
stream probabilities. For instance, 1.5 can be expressed as 0.75
+ 0.75. Each of these probabilities can be represented by a
conventional binary stochastic stream as shown in Fig. 5(a).
Therefore, the integer stochastic representation of 1.5 can be
readily achieved as a summation of generated binary stochastic
streams as illustrated in Fig. 5(b). In general, the integer
4X: 1,1,1,0,1,1,1,1 (0.875)
Integer Stochastic 
Computational 
Element
 Stochastic Stream of 0.5625: 
1,0,1,0,1,0,1,1   ,    0,1,0,0,1,0,1,1
S: 1,1,1,0,2,0,2,2 (9/16)
(a)
Stochastic Stream X of 0.875: 1,1,1,0,1,1,1,1,1,1,1,1,1,0,1,1
Stochastic Stream Y of 0.5625: 1,0,1,0,1,0,1,1,0,1,0,0,1,0,1,1
Stochastic 
Computational 
Element
X(1:8): 1,1,1,0,1,1,1,1 
S(1:8): 1,0,1,0,1,0,1,1
Stochastic 
Computational 
Element
X(9:16): 1,1,1,1,1,0,1,1 
S(9:16): 0,1,0,0,1,0,1,1
Y(1:8)
Y(9:16)
(b)
Fig. 6. (a) Increasing the range value m of the integer stochastic stream
reduces computations latency. (b) Parallelized stochastic computation by factor
of two.
stochastic stream S representing the real value s is a sequence
with elements Si, i = {1, 2, . . . , N}:
Si =
m∑
j=1
Xji , (13)
where Xji denotes each element of a binary stochastic se-
quence representing a real value xj . The expected value of
the integer stochastic stream is then given by
s = E[Si] =
m∑
j=1
xj . (14)
We can also generate integer stochastic streams in the
bipolar format. In that case, the elements Si of the stream
are given by:
Si = 2×
m∑
j=1
Xji −m, (15)
and the value represented by the stream is
s = E[Si] = 2×
m∑
j=1
E[Xji ]−m = 2×
m∑
j=1
xj −m. (16)
Any real number can be approximated by using an integer
stochastic stream without prior scaling, as opposed to a con-
ventional stochastic stream which is restricted only to the [-1,
1] interval. In integral SC, computation on two streams with
different effective length is also possible while conventional
SC fails to provide this property. For instance, representation
of 0.875 and 0.5625 require effective bit-stream lengths of
8 and 16, respectively, using conventional SC. Therefore,
effective bit-stream lengths of 16 is used to generate the
S1: 2,0,1,1,0,2,1,1 (8/8)
S2: 1,2,2,0,1,2,0,2 (10/8)
Y: 2,0,2,0,0,4,0,2 (10/8)
(a)
S: 1,2,2,2,1,2,0,2 (12/8)
Y: 1,0,0,2,0,0,0,0 (3/8)0
1
X: 1,0,0,1,0,0,0,0 (2/8)
0
Bit-wise 
AND
S
X
Y
(b)
Fig. 7. (a) Integer stochastic multiplier with m = 2 (b) Multiplication of
integer stochastic stream with binary stochastic bit-stream using AND gate or
MUX
conventional stochastic bit-stream of these two numbers for
operations. However, the second number which requires higher
effective length, i.e., 0.5625 in this example, can be generated
by using the proposed integral SC with m = 2 as shown in
Fig. 6(a). In this case, the bit-stream length of 8 is used for
both numbers and operations can be performed by using lower
lengths w.r.t. conventional SC. This technique potentially re-
duces the latency brought by stochastic computations, making
integral SC suitable for throughput-intensive applications. It
is worth mentioning that the integral SC is different from the
conventional parallelized SC [26]. For the sake of clarity, the
aforementioned example is illustrated in Fig. 6(b) by using
the conventional parallelized SC by factor of two. This is due
to the fact that if several copies of a binary SC system are
instantiated, the inputs still need to have the same effective
length.
In summary, a real number s ∈ [0,m] is first divided into
the summation of multiple numbers which are in [0, 1] interval.
Then, the integer stochastic stream of this number is generated
by using column-wise addition (see equations (13)-(14)). The
bipolar format of the integer stochastic stream is generated
in a similar way. Note that the binary to integer stochastic
convertor is hereafter referred to as B2IS and it is composed
of m B2S convertors followed by and adder as shown in Fig. 5.
B. Implicit Scaling of Integer Stochastic Stream
The integer stochastic representation of a real number s ∈
[0, 1] can also be generated by using an implicit scaling factor.
In this method, the expected value of the individual binary
streams is chosen as xj = s, and the value s represented by
the integer stream is given by
s =
E[Si]
m
. (17)
This method avoids the need to divide s by m to obtain xj , and
can be easily taken into account in subsequent computations.
For instance, a real number 9/16 can be represented using
an integer stream length of 8 with m = 2. We can set xj =
9/16 (with an implicit scaling factor of 1/2) and generate
two binary sequences of length 8. These sequences are then
added together to form the integer sequence S. We obtain
5Data: Stochastic stream Xi ∈ {0, 1} where
i ∈ {1, 2, ..., N}
Result: Yi
Counter ← Initial value;
for i← 1 : N do
Counter ← Counter + 2Xi - 1;
if Counter > n-1 then
Counter ← n-1;
end
if Counter < 0 then
Counter ← 0;
end
if Counter > offset then
Yi ← 1;
else
Yi ← 0;
end
end
Algorithm 1: Pseudo code of the conventional algorithm for
FSM-based functions
E[Si] = 9/8, which corresponds to s = 9/16 because of the
implicit scaling factor of 1/2 (see Fig. 6(a)).
C. Multiplication In Integral SC
The main advantage of SC compared to its binary radix
format is the low complexity implementation of mathematical
operations. It is shown that multiplication can be implemented
by using AND or XNOR gates depending on the coding format.
However, integer stochastic multipliers make use of binary
radix multipliers (see Fig. 7(a)). The multiplication of two real
numbers s1 ∈ [0,m] and s2 ∈ [0,m′] with integer stochastic
streams S1 and S2 in unipolar format is performed as follows:
y = s1 × s2 = E[S1i × S2i ] = E[S1i ]× E[S2i ], (18)
if S1i and S
2
i are independent.
The above equation holds true for integer stochastic multi-
plication in bipolar format as well. The implementation cost
of this multiplier strongly depends on m and m′. Considering
one of these two values to be equal to "1", the multiplication
can be implemented using bit-wise AND gate or a MUX as
depicted in Fig. 7(b). The range of y is [0,m × m′] in the
unipolar case, and [−m×m′,m×m′] in the bipolar case.
D. Addition In Integral SC
Conventional SC suffers from precision loss incurred by
using scaled adder, making SC inappropriate for applications
which require many additions. On the other hand, integral SC
uses binary radix adders to perform additions in this domain,
preserving all information. Using (14), addition in unipolar
format is performed as follows:
y = s1 + s2 = E[s1 + s2] = E[S1i ] + E[S2i ], (19)
since the expected value operator is linear.
Equation 19 remains valid also in the bipolar case, while
the range of y is [0,m +m′] and [−(m +m′),m +m′] for
Data: Integer value Si ∈ {−m, ...,m} where
i ∈ {1, 2, ..., N}
Result: Yi
Counter ← Initial value;
for i← 1 : N do
Counter ← Counter + Si;
if Counter > n×m-1 then
Counter ← n×m-1;
end
if Counter < 0 then
Counter ← 0;
end
if Counter > offset then
Yi ← 1;
else
Yi ← 0;
end
end
Algorithm 2: Pseudo code of the proposed algorithm for
integer stochastic FSM-based functions
unipolar and bipolar formats respectively. This adder provides
some advantages similar to APC. First of all, due to the fact
that it retains all information provided as inputs, it reduces
the variance of the sum. Secondly, it potentially reduces
the bit-stream length required for computations compared to
conventional SC [22]. Moreover, the output of this adder is still
an integer stochastic stream, which can be used by subsequent
stochastic computational units, as opposed to APC.
E. FSM-Based Functions In Integral SC
The inputs of stochastic FSM-based tanh and exponen-
tiation functions are restricted to real values in the [-1, 1]
interval. Therefore, a desired tanh or exponentiation function
can be achieved by scaling down the inputs and adjusting the
term n in (11) and (12), which potentially increases bit-stream
length and results in long latency. The transition between each
state of FSM is performed according to the input value in
bipolar format, which is either 1 or 0. This state transition can
be formulated as shown in Algorithm 1 in conventional SC.
According to the Algorithm 1, the input value in bipolar format
is first converted to either 1 or -1 as an input of either 1 or 0,
respectively. Then, the counter of FSM is added with the new
encoded values which are similar to the values in an integral
stochastic stream with m = 1. Therefore, the values of the
conventional stochastic stream can be viewed as hard values
of an integral stochastic stream. The FSM-based functions in
integral SC can be achieved by extending the conventional
FSM-based functions to support soft values in integral SC,
which is explained below.
The integer stochastic tanh and exponentiation functions are
proposed by generalizing Alg. 1. In integral SC, each element
of a stochastic stream is represented using 2’s complement or
sign-magnitude representations in {−m, . . . ,m} for bipolar
format. A state counter is increased or decreased according
to the integer input value Si ∈ {−m, . . .m} where i ∈
6-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
s
O
u t
p u
t
 
 
Tanh(s)
NStanh(4,S), m = 2
NStanh(8,S), m = 4
NStanh(16,S), m = 8
Stanh(2,X)
(a)
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
s
O
u t
p u
t
 
 
Tanh(2s)
NStanh(8,S), m = 2
NStanh(16,S), m = 4
NStanh(32,S), m = 8
Stanh(4,X)
(b)
Fig. 8. (a) Integer stochastic implementation of tanh(s) and (b) Integer stochastic implementation of tanh(2s)
0 0.5 1 1.5 2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
s
O
u t
p u
t
 
 
Exp(-s)
NSexp(512,1,S), m=2
NSexp(1024,2,S), m=4
NSexp(2048,4,S), m=8
(a)
0 0.5 1 1.5 2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
s
O
u t
p u
t
 
 
Exp(-2s)
NSexp(1024,2,S), m = 2
NSexp(2048,4,S), m = 4
NSexp(4096,8,S), m = 8
Sexp(512,1,X)
(b)
Fig. 9. (a) Integer stochastic implementation of exp(−s) and (b) Integer stochastic implementation of exp(−2s)
{1, 2, ..., N}. Therefore, the state counter is incremented or
decremented by up to m in each clock cycle, as opposed
to conventional FSM-based functions which are restricted to
one-step transitions. The algorithm for integer FSM-based
functions is proposed as shown in Algorithm 2.
The output of the proposed integer FSM-based functions
in integral SC domain and its encoding format are similar
to the conventional FSM-based functions. For instance, the
output of the integer tanh function is in bipolar format while
the output of integer exponentiation function is in unipolar
format. Moreover, the integer FSM-based functions require m
times more states compared to its conventional counterpart.
Therefore, the approximate transfer function of integer tanh
and exponentiation functions, which are referred to as NStanh
and NSexp, respectively, are:
tanh
(ns
2
)
≈ E[NStanh (m× n, S)], (20)
exp (−2Gs) ≈ E[NSexp (m× n,m×G,S)] : s > 0. (21)
In order to show the validity of the proposed algorithm,
Monte-Carlo simulation is used. Fig. 8 illustrates two ex-
amples of the proposed NStanh function compared to its
corresponding Stanh and tanh functions for different values
of m. Simulation results show that NStanh is more accurate
than Stanh for m > 1 and that the accuracy improves
as the value of m increases. Moreover, NStanh is able to
approximate tanh for input values outside of the [-1, 1] range
with negligible performance loss, while Stanh does not work.
The proposed NStanh function can also approximate tanh
functions with fractional scaling factor, e.g. tanh (3/2x) ≈
NStanh (3×m,S), as long as the value m is even, to make
sure that the number of states is even. The aforementioned
statements also hold true for NSexp, unlike with Sexp, as
shown in Fig. 9. The proposed FSM-based functions in integral
7TABLE I
HARDWARE COMPLEXITY OF THE PROPOSED FSM-BASED FUNCTIONS @ 400 MHZ IN A 65 NM CMOS TECHNOLOGY
m (Stream Length) 1 (1024) 2 (512) 4 (256) 8 (128)
Area (µm2) Power (µW) Area (µm2) Power (µW) Area (µm2) Power (µW) Area (µm2) Power (µW)
tanh(s) 24 3.5 74 9.8 117 18.3 150 24.9
tanh(2s) 63 9.4 107 17.2 141 22.6 182 31.2
exp(−s) – – 424 62.1 474 72.5 480 80
exp(−2s) 424 57.1 440 65.1 491 74.5 532 93.1
SC also result in better approximation as the value of n
increases, similar to conventional stochastic FSM-based func-
tions. The hardware complexity of the proposed FSM-based
functions in a 65 nm CMOS technology is also summarized
in Table I. The implementation results show that the proposed
FSM-based functions consume roughly 7 times more power
at most while having 8 times less latency, which results in
a lower energy consumption, compared to the conventional
FSM-based functions (i.e., FSM-based functions with m = 1).
Note that the stream length of FSM-based functions denotes
the latency.
IV. INTEGER STOCHASTIC IMPLEMENTATION OF DBN
A. A Review on the DBN Algorithm
DBNs are the hierarchical graphical models obtained by
stacking RBMs on top of each other and training them in
a greedy unsupervised manner [4], [5]. DBNs take low-
level inputs and construct higher-level abstractions through
the composition of layers. Both the number of layers and the
number of inputs in each layer can be adjusted. Increasing
the number of layers and their size tends to improve the
performance of the network.
In this paper, we exploit a DBN constructed using two
layers of RBM, which are also called hidden layers, followed
by a classification layer at the end for handwritten digit
recognition.. As a benchmark, we use the Mixed National
Institute of Standards and Technology (MNIST) data set [27].
This data set provides thousands of 28×28 pixel images for
both training and testing procedures. Each pixel is represented
by an integer number between 0 to 255, requiring 8 bits for
digital representation. As mentioned in Section I, the training
procedure can be performed on remote servers in the cloud.
Therefore, the extracted weights are stored in a memory for
the hardware inference engine to classify the input images in
real-time.
Fig. 10 shows the DBN used for handwritten digits classi-
fication in this paper. Inputs of DBN and outputs of a hidden
layer are hereafter referred to as visible nodes and hidden
nodes, respectively. Each hidden node is also called neuron.
The hierarchical computations of each neuron are performed
as follows:
zj =
M∑
i=1
Wijvi + bj , (22)
hj =
1
1 + exp(−zj) = σ(zj), (23)
+ +
σ σ σ σ 
v1 v2 vM
σ σ σ σ 
Visible 
Nodes
Hidden 
Layer 1
Hidden 
Layer 2
Output 
Nodes
W1
W3
W2
Fig. 10. The high-level architecture of 2-layer DBN.
where M denotes the number of visible nodes, vj the value of
visible nodes, Wij the extracted weights, bj the bias term, zj
intermediate value, hj the output value of each hidden node
and j an index to each hidden node. The nonlinearity function
used in DBN , i.e., equation (23), is called a sigmoid function.
The classification layer does not require a sigmoid function as
it is only used for quantization. In other words, the maximum
value of the output denotes the recognized label.
B. The Proposed Stochastic Architecture of a DBN
VLSI implementations of a DBN network in binary form
are computationally expensive since they require many matrix
multiplications. Moreover, there is no straightforward way
to implement the sigmoid function in hardware. Therefore,
this unit is normally implemented by LUTs, which requires
additional memory in addition to the memory used for storing
weights. Considering 10 bits for weights, 78400 10b×8b-
multipliers are required to do the matrix multiplications of
the first hidden layer for a parallel implementation of a
network with configuration of 784-100-200-10, meaning 784
visible nodes, 100 first-layer hidden nodes, 200 second-layer
hidden nodes and 10 output nodes. Note that the parallel
implementation of such a networks results in huge silicon
area in part due to its routing congestion caused by the layer
interconnection.
Stochastic implementation of DBN is a promising approach
to perform the mentioned complex arithmetic operations using
8Log2(m)+1
b
1
W1
v1
W2
v2
WM
vM
Tree 
Adder
NStanh
B2IS
B2S
B2IS
B2S
B2IS
B2S
B2S
B2IS
Log2(m)+1
Log2(m)+1
Log2(m)+1
Log2(m)+1
Log2(m)+1
Log2(m)+1
Log2(m')+1 Stochastic 
Stream
Stochastic 
Neuron
Bit-wise
AND
Bit-wise
AND
Bit-wise
AND
Bit-wise
AND
Log2(m)+1
Fig. 11. The proposed integer stochastic neuron. The B2IS and B2S denote
binary to integer stochastic and binary to stochastic converters, respectively.
simple and low-cost elements. In order to find the output
value of the first hidden node, 784 multiplications are re-
quired, which can be easily performed by using AND gates in
unipolar format. Then, addition of multipliers output should
be performed by using a scaled adder or an OR gate. Using
a scaled adder to sum 784 numbers requires an extremely
long bit-stream due to the fact that the output result of this
adder is scaled down by 784 times, a very small number to
be represented by short stream length. In [28], an OR gate is
used as an adder to perform this computation while the inputs
first are scaled down to make the term "A · B" close to 0 in
(10), which potentially increases the required stream length for
computations. An APC is also proposed in [22] to realize the
matrix operations. Despite its good performance on additions,
it is not a suitable approach for a stochastic DBN, since it
converts the results to a binary form [22].
We have shown in Section III-A that the integer stochastic
stream can be generated by adding conventional stochastic
streams. Considering that the multiplications of the first layer
of a DBN are performed in conventional stochastic domain,
the nature of the algorithm is to add the multiplication results
together. Exploiting a binary tree adder, the addition result
remains in integer-stochastic form without any precision loss.
The sigmoid function can also be implemented in the integer
stochastic domain.
It is well-known that the sigmoid function can be computed
using the tanh function as follows:
σ(x) =
1 + tanh
(x
2
)
2
. (24)
The tanh function can also be implemented by NStanh func-
tion (see (20)) in integer stochastic domain. The output of
NStanh is in bipolar format in conventional stochastic domain.
Therefore, considering its output in unipolar format according
to (24) and (2), the output of NStanh is equivalent to the
sigmoid function in stochastic domain.
Fig. 11 shows the proposed integer stochastic architecture
of a single neuron. The input signal stream is generated by
using conventional stochastic domain: however, the weights
−15 −10 −5 0 5 10 150
10
20
30
40
50
60
Inputs of NStanh function
Co
un
ts
Fig. 12. Histogram of integer values as inputs of NStanh function at the first
layer of a 784-100-200-10 DBN.
TABLE II
THE MISCLASSIFICATION ERROR OF THE PROPOSED ARCHITECTURES
FOR DIFFERENT NETWORK SIZES AND STREAM LENGTHS
Misclassification Error (%)
[29] Proposed
Code Type Floating Point Integeral SC
m – 1 2 4
Stream Length – 1024 512 256
784-100-200-10 2.29 2.40 2.46 2.33
784-300-600-10 1.82 2.01 1.89 1.90
are represented by 2’s complement format in integer stochastic
domain with range of m, which requires log 2 (m)+1 bits for
representation. The multiplications are performed bit-wise by
AND gates since pixels and weights are represented by binary
stochastic streams and integral stochastic streams, respectively.
A tree adder and an NStanh unit are used to perform the
additions and nonlinearity function, respectively. The output
of the integer stochastic sigmoid function is represented by a
single wire in unipolar format. Therefore, the input and output
formats are the same. Integer stochastic architecture of DBN
is formed by stacking the proposed single neuron architecture.
The input images require a minimum bit-stream length of
256, but since the weights lie in the [−4, 4] interval they
require a minimum bit-stream length of 1024 in conventional
stochastic domain. Therefore, the latency of the proposed
integer-stochastic implementation of the DBN is equal to 1024
for m = 1.
The input range of the NStanh function, i.e. the value of
m′ in Fig. 11, is selected through simulation. The histogram
of the adder outputs identifies this range by taking a window
which covers 95% of data. For instance, Fig. 12 shows the
histogram of integer values as inputs of NStanh function at
the first layer of a 784-100-200-10 DBN. This diagram is
generated based on the non-correlated stochastic inputs and
the selected range for this network is 6, i.e., the value of m′
in Fig. 11. This range strongly depends on the correlations
among the stochastic inputs. The range would be a bigger
number as the correlation increases. For instance, summa-
tion of two correlated stochastic streams, {1, 1, 0, 0, 1, 0}
and {1, 1, 0, 1, 0, 0}, representing real value of 0.5 results in
integral stochastic stream of {2, 2, 0, 1, 1, 0} and input range
9TABLE III
IMPLEMENTATION RESULTS OF THE PROPOSED ARCHITECTURE ON FPGA VIRTEX-7
Network Size Stream Length Misclassification Error Area (# of LUTs) Latency (µs) Throughput (Mbps)
784-100-200-10 256 2.33% 1,013,002 1.705 3822
Proposed 784-100-200-10 512 2.46% 682,352 3.412 1874
784-100-200-10 1024 2.40% 437,461 6.503 974
784-100-200-10 1024 5.72% 144,450 8.561 NA
[28] 784-300-600-10 1024 2.92% 603,750 9.797 NA
784-500-1000-10 1024 2.32% 1,292,310 10.77 NA
of 2 while summation of two uncorrelated stochastic streams,
{0, 0, 1, 0, 1, 1} and {1, 1, 0, 1, 0, 0}, representing real value of
0.5 results in integral stochastic stream of {1, 1, 1, 1, 1, 1} and
input range of 1. Correlation among the inputs is introduced
when the same LFSR units are shared among several inputs,
in order to reduce hardware area. In this paper, the set of
LFSR units that are used for one neuron are shared for all the
other neurons. More precisely, 785 11-bit LFSRs with different
seeds are used in total to generated all inputs and weights of
the proposed DBN architectures and guarantee non-correlated
stochastic streams.
V. IMPLEMENTATION AND SIMULATION RESULTS
A. Misclassification Error Rate Comparison
The misclassification error rate of DBNs plays a crucial role
in the performance of the system. In this part, the misclassi-
fication errors of the proposed integer stochastic architectures
of DBNs with different configurations are summarized in
Table II. Simulation results have been obtained by using
MATLAB on 10000 MNIST handwritten test digits [27] for
both floating point code and the proposed architecture using
LFSRs as the stream generators. The method proposed in
[29] is used as our training core to extract the network
weights. In fixed-point format, a precision of 10 bits is used
to represent the weights. A stochastic stream of equivalent
precision requires a length of 1024. The length of the stream
can be reduced by increasing m. For example, using m = 2
the length can be reduced to 512, and using m = 4 it can be
reduced to 256. Because the input pixels only require 8 bits
of precision, they can be represented using a binary (m = 1)
stochastic stream of length 256. Therefore, by using m = 1
for the pixels and m = 4 for the weights, it is possible
to reduce the stream length to 256 while still using AND
gates to implement multiplications. The simulation results
show the negligible performance loss of the proposed integer
stochastic DBN for different sizes compared to their floating
point versions. The reported misclassification errors for the
proposed integral stochastic architecture were obtained using
LFSR units as random number generators in MATLAB.
B. FPGA Implementation
As mentioned previously, a fully- or semi-parallel VLSI
implementation of DBN in binary form requires a lot of
hardware resources. Therefore, many works target FPGAs
[30]–[35], but none manage to fit a fully-parallel deep neural
TABLE IV
ASIC IMPLEMENTATION RESULTS FOR A 784-100-200-10 NETWORK @
400 MHZ AND 1V IN A 65 NM CMOS TECHNOLOGY
Implementation Type Integral SC Binary Radix
Stream Length 256 512 1024 –
Misclassification error [%] 2.33 2.46 2.40 2.3
Energy [µJ] 2.96 3.3 3.35 0.380
Gate Count [M Gates (NAND2)] 4.2 2.2 1.1 23.6
Latency [ns] 650 1290 2570 30
network architecture in a single FPGA board. Recently, a
fully pipelined FPGA architecture of a factored RBM (fRBM)
was proposed in [9], which could implement a single layer
neural network consisting of 4096 nodes using virtualization
technique, i.e., time multiplex sharing technique, on a Virtex-
6 FPGA board. However, the largest fRBM neural network
achievable without virtualization is on the order of 256 nodes.
In [28], a stochastic implementation of DBN on a FPGA
board is presented for different network sizes, however, this
architecture cannot achieve the same misclassification error
rate as a software implementation. Table III shows both
the hardware implementation and performance results of the
proposed integer stochastic architecture of DBN for different
network sizes on a Virtex7 xc7v2000t Xilinx FPGA. The
implementation results show that the misclassification error of
the proposed architectures for network size of 784-100-200-10
is the same as for the largest network presented in [28], i.e.,
the network size of 784-500-1000-10, while the area of the
proposed designs are reduced by 66%, 47% and 21% for m
= 1, m = 2 and m = 4. Moreover, the latency of the proposed
architectures are also reduced by 40%, 63% and 84% for m =
1, m = 2 and m = 4. Therefore, as the value of m increases,
the latency of the integer stochastic hardware is reduced and
becomes suitable for throughput-intensive applications. Note
that the reported areas in Table III include the costs of B2S
and B2IS units.
C. ASIC Implementation
Table IV shows the ASIC implementation results for a fixed-
point implementation of the network size of 784-100-200-
10. Despite the improvements that the proposed architectures
provide over previously proposed stochastic implementations,
the stochastic implementations still uses more energy than
the fixed-point implementation in 65 nm CMOS, even if the
10
TABLE V
ASIC IMPLEMENTATION RESULTS FOR A 784-300-600-10 NETWORK BASED ON INTEGRAL SC @ 400 MHZ AND 1V IN A 65 NM CMOS TECHNOLOGY
Implementation Type Integral SC Binary Radix
Network Configuration 784-300-600-10 784-100-200-10
Value of m 1 2 4 –
Stream Length 64 128 256 512 32 64 128 256 16 32 64 128 –
Misclassification error [%] 2.49 2.24 2.22 2.07 2.42 2.30 2.24 1.96 2.27 2.22 2.07 1.95 2.3
Energy [µJ] 0.740 1.436 2.802 5.640 0.505 1.029 1.997 3.933 0.299 0.640 1.28 2.53 0.380
Gate Count [M Gates (NAND2)] 5.4 5.6 5.6 5.6 9.2 9.7 10.2 10.2 15.6 16.9 17.8 18.9 23.6
Latency [ns] 170 330 650 1290 90 170 330 650 50 90 170 330 30
TABLE VI
DEVIATIONS OF LAYER-1 AND LAYER-2 NEURONS FOR A
784-300-600-10 NETWORK
Deviation (%)
Layer-1 Neuron Layer-2 Neuron
0.7V 17.90 15.41
0.75V 8.57 3.95
0.8V 0.011 ≈ 0
power consumption and area of a stochastic neuron are smaller.
A similar result was also obtained in [36] for stochastic
implementations of image processing circuits.
In order to improve the energy consumption of the proposed
stochastic architectures, we select a bigger network size with
better misclassification rate and reduce the stream length to
achieve roughly the same misclassification error rate as the
binary radix implementation in Table IV. The implementation
results of a 784-300-600-10 neural network based on integral
SC for different stream lengths and values of m are summa-
rized in Table V. The implementation results show that the
integral stochastic architecture for value of m = 4 and stream
length of 16 at misclassification error rate of 2.3% consumes
21% less energy as well as 34% less area compared to the
binary radix implementation.
D. Quasi-Synchronous Implementations
In order to further reduce the energy consumption of the
system, we also consider a quasi-synchronous implementation,
in which the supply voltage of the circuit is reduced beyond the
critical voltage by permitting some timing violations to occur.
Timing violations introduce deviations in the computations,
but because the stochastic architecture is fault-tolerant, we
can obtain the same classification performance by slightly
increasing the length of the streams. This yields further energy
savings without any compromise on performance.
We characterize the effect of timing violations on the
algorithm by studying small test circuits that can be simulated
quickly, using the same approach as in [37]. In the proposed
architecture, the same processing circuit can be replicated
several times to form each layer, depending on the required
degree of parallelism. Therefore, we characterize the effect
of timing violations on these small processing circuits: each
neuron processor (one for each layer) is synthesized in a
65 nm CMOS technology and deviations are measured at
different voltages, from 0.7V to 1.0V in 0.05V increments,
TABLE VII
ASIC IMPLEMENTATION RESULTS FOR A 784-300-600-10 NETWORK @
400 MHZ IN A 65 NM CMOS TECHNOLOGY UNDER FAULTY CONDITIONS
Implementation Type Integral SC
Supply Voltage (Layer-1–layer-2–layer-3) 0.8–0.7–0.8 0.75–0.75–0.8 0.8–0.8–0.8
Stream Length 30 30 22
Misclassification error [%] 2.29 2.28 2.30
Energy [µJ] (improvement w.r.t. 1V) 0.283 (-5%) 0.286 (-4%) 0.256 (-14%)
Gate Count [M Gates (NAND2)] 15.6 15.6 15.6
Latency [ns] 85 85 65
as shown in Table VI. Note that no deviations are observed
when the supply voltage is larger than 0.8V. The output of first
and second layers is binary, while the output of classification
layer has 6 bits. Binary to stochastic converter units are also
considered for each neuron and the weights are hard coded
for the implementations.
The deviation error of the layer-3 neuron for 0.7V and 0.75V
results in a huge misclassification error. It is not beneficial to
allow large deviations to occur in that layer since there are
only 10 neurons in the third layer, and therefore we do not
expect the supply voltage of layer-3 processing circuits to have
a big impact on the overall energy consumption. Therefore,
the layer-3 neurons supplied with 0.8V are used. Note that no
deviations are observed when the supply voltage is 0.8V in
the layer-3 neurons.
The performance results for a 784-300-600-10 network
and m = 4 at different supply voltages are provided in
Table VII. The misclassification performance obtained by the
quasi-synchronous system is very similar to the performance
of the reliable system, despite the fact that the deviation rate is
up to 9% in layer-1 neurons and 16% in layer-2 neurons. This
results in up to a 14% lower energy consumption without any
compromise on performance. On the other hand, introducing
bit-wise deviations at a rate of 1% in the fixed-point system
results in a 87% misclassification rate. Note that the reported
implementation results in this paper include costs of B2N and
B2IS units.
Moreover, because a stochastic implementation is much
more fault-tolerant than a fixed-point implementation, it can
be preferable for future process technologies, and in particular
for inherently unreliable ones such as nanoscale memristor
devices. Note that memristor devices consume substantially
less energy compared to CMOS and can be scaled to sizes
below 10 nm [23]. In [23], stochastic implementations were
suggested as a promising approach for use in such devices.
11
VI. CONCLUSION
Integral SC makes the hardware implementation of
precision-intensive applications feasible in the stochastic do-
main, and allows computations to be performed with streams
of different lengths, which can improve the latency of the
system. An efficient stochastic implementation of a deep belief
network is proposed using integral SC. The simulation and
implementation results show that the proposed design reduces
the area occupation by 66% and the latency by 84% with
respect to the state of the art. We also showed that the
proposed design consumes 21% less energy than its binary
radix counterpart. Moreover, the proposed architectures can
save up to 33% energy consumption w.r.t. the binary radix
implementation by using quasi-synchronous implementation
without any compromise on performance.
ACKNOWLEDGEMENT
The authors would like to thank C. Condo for his helpful
suggestions.
REFERENCES
[1] A. Ardakani, F. Leduc-Primeau, N. Onizawa, T. Hanyu, and W. J.
Gross, “VLSI Implementation of Deep Neural Network Using Integral
Stochastic Computing,” in Int. Symp. on Turbo Codes & Iterative
Information Processing, 2016, pp. 1–5.
[2] S. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J. Yoo, “4.6
A1.93TOPS/W scalable deep learning/inference processor with tetra-
parallel MIMD architecture for big-data applications,” in IEEE Int.
Solid- State Circuits Conference (ISSCC), Feb 2015, pp. 1–3.
[3] G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-Dependent Pre-Trained
Deep Neural Networks for Large-Vocabulary Speech Recognition,”
IEEE Trans. on Audio, Speech, and Language Processing, vol. 20, no. 1,
pp. 30–42, Jan 2012.
[4] G. Hinton, S. Osindero, and Y. Teh, “A Fast Learning Algorithm for
Deep Belief Nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554,
July 2006.
[5] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
July 2006.
[6] M. A. Arbib, Ed., The Handbook of Brain Theory and Neural Networks,
2nd ed. Cambridge, MA, USA: MIT Press, 2002.
[7] P. Luo, Y. Tian, X. Wang, and X. Tang, “Switchable Deep Network for
Pedestrian Detection,” in IEEE Conf. on Computer Vision and Pattern
Recognition (CVPR), June 2014, pp. 899–906.
[8] X. Zeng, W. Ouyang, and X. Wang, “Multi-stage Contextual Deep
Learning for Pedestrian Detection,” in IEEE Int. Conf. on Computer
Vision (ICCV), Dec 2013, pp. 121–128.
[9] L.-W. Kim, S. Asaad, and R. Linsker, “A Fully Pipelined FPGA
Architecture of a Factored Restricted Boltzmann Machine Artificial
Neural Network,” ACM Trans. Reconfigurable Technol. Syst., vol. 7,
no. 1, pp. 5:1–5:23, Feb. 2014.
[10] A. Alaghi, C. Li, and J. Hayes, “Stochastic circuits for real-time image-
processing applications,” in 50th ACM/EDAC/IEEE Design Automation
Conference (DAC), May 2013, pp. 1–6.
[11] S. Tehrani, S. Mannor, and W. Gross, “Fully Parallel Stochastic LDPC
Decoders,” IEEE Trans. on Signal Processing, vol. 56, no. 11, pp. 5692–
5703, Nov 2008.
[12] Y. Ji, F. Ran, C. Ma, and D. Lilja, “A hardware implementation of a
radial basis function neural network using stochastic logic,” in Design,
Automation Test in Europe Conference Exhibition (DATE), March 2015,
pp. 880–883.
[13] Y. Liu and K. K. Parhi, “Architectures for Recursive Digital Filters
Using Stochastic Computing,” IEEE Transactions on Signal Processing,
vol. 64, no. 14, pp. 3705–3718, July 2016.
[14] B. Yuan and K. K. Parhi, “Successive cancellation decoding of polar
codes using stochastic computing,” in IEEE Int. Symp. on Circuits and
Systems (ISCAS), May 2015, pp. 3040–3043.
[15] W. Qian, X. Li, M. D. Riedel, K. Bazargan, and D. J. Lilja, “An
Architecture for Fault-Tolerant Computation with Stochastic Logic,”
IEEE Transactions on Computers, vol. 60, no. 1, pp. 93–105, Jan 2011.
[16] P. Li and D. J. Lilja, “Using stochastic computing to implement digital
image processing algorithms,” in IEEE 29th International Conference
on Computer Design (ICCD), Oct 2011, pp. 154–161.
[17] P. Li, D. J. Lilja, W. Qian, K. Bazargan, and M. D. Riedel, “Computation
on Stochastic Bit Streams Digital Image Processing Case Studies,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22,
no. 3, pp. 449–462, March 2014.
[18] A. Alaghi, C. Li, and J. P. Hayes, “Stochastic Circuits for Real-time
Image-processing Applications,” in Proceedings of the 50th Annual
Design Automation Conference, ser. DAC ’13. New York, NY, USA:
ACM, 2013, pp. 136:1–136:6.
[19] J. L. Rosselló, V. Canals, and A. Morro, “Hardware implementation
of stochastic-based Neural Networks,” in The 2010 International Joint
Conference on Neural Networks (IJCNN), July 2010, pp. 1–4.
[20] J. Dickson, R. McLeod, and H. Card, “Stochastic arithmetic implemen-
tations of neural networks with in situ learning,” in IEEE Int. Conf. on
Neural Networks, 1993, pp. 711–716 vol.2.
[21] B. Gaines, “Stochastic Computing Systems,” in Advances in Information
Systems Science, ser. Advances in Information Systems Science, J. Tou,
Ed. Springer US, 1969, pp. 37–172.
[22] P.-S. Ting and J. Hayes, “Stochastic Logic Realization of Matrix
Operations,” in 17th Euromicro Conf. on Digital System Design (DSD),
Aug 2014, pp. 356–364.
[23] P. Knag, W. Lu, and Z. Zhang, “A native stochastic computing architec-
ture enabled by memristors,” IEEE Trans. on Nanotechnology, vol. 13,
no. 2, pp. 283–293, March 2014.
[24] P. Li, W. Qian, and D. Lilja, “A stochastic reconfigurable architecture
for fault-tolerant computation with sequential logic,” in IEEE 30th
International Conference on Computer Design (ICCD), Sept 2012, pp.
303–308.
[25] B. Brown and H. Card, “Stochastic neural computation. I. Computational
elements,” IEEE Trans. on Computers, vol. 50, no. 9, pp. 891–905, Sep
2001.
[26] D. Cai, A. Wang, G. Song, and W. Qian, “An ultra-fast parallel
architecture using sequential circuits computing on random bits,” in
IEEE International Symposium on Circuits and Systems (ISCAS2013),
May 2013, pp. 2215–2218.
[27] Y. Lecun and C. Cortes, “The MNIST database of handwritten digits.”
[Online]. Available: http://yann.lecun.com/exdb/mnist/
[28] B. Li, M. Najafi, and D. J. Lilja, “An FPGA implementation of a
Restricted Boltzmann Machine classifier using stochastic bit streams,”
in IEEE 26th Int. Conf. on Application-specific Systems, Architectures
and Processors (ASAP), July 2015, pp. 68–69.
[29] M. Tanaka and M. Okutomi, “A Novel Inference of a Restricted
Boltzmann Machine,” in 22nd Int. Conf. on Pattern Recognition (ICPR),
Aug 2014, pp. 1526–1531.
[30] C. Cox and W. Blanz, “GANGLION-a fast hardware implementation
of a connectionist classifier,” in Proc. of the IEEE Custom Integrated
Circuits Conf., May 1991, pp. 6.5/1–6.5/4.
[31] J. Zhao and J. Shawe-Taylor, “Stochastic connection neural networks,”
in Fourth Int. Conf. on Artificial Neural Networks, Jun 1995, pp. 35–39.
[32] M. Skubiszewski, “An exact hardware implementation of the Boltzmann
machine,” in Proc. of the Fourth IEEE Symposium on Parallel and
Distributed Processing, Dec 1992, pp. 107–110.
[33] S. K. Kim, L. McAfee, P. McMahon, and K. Olukotun, “A highly
scalable Restricted Boltzmann Machine FPGA implementation,” in Int.
Conf. on Field Programmable Logic and Applications, Aug 2009, pp.
367–372.
[34] D. Ly and P. Chow, “A multi-FPGA architecture for stochastic Restricted
Boltzmann Machines,” in Int. Conf. on Field Programmable Logic and
Applications, Aug 2009, pp. 168–173.
[35] D. Le Ly and P. Chow, “High-Performance Reconfigurable Hardware
Architecture for Restricted Boltzmann Machines,” IEEE Trans. on
Neural Networks, vol. 21, no. 11, pp. 1780–1792, Nov 2010.
[36] P. Li, D. Lilja, W. Qian, K. Bazargan, and M. Riedel, “Computation
on stochastic bit streams digital image processing case studies,” IEEE
Trans. on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 3,
pp. 449–462, March 2014.
[37] F. Leduc-Primeau, F. R. Kschischang, and W. J. Gross, “Modeling
and Energy Optimization of LDPC Decoder Circuits with Timing
Violations,” CoRR, vol. abs/1503.03880, 2015. [Online]. Available:
http://arxiv.org/abs/1503.03880
