ESSOP: Efficient and Scalable Stochastic Outer Product Architecture for
  Deep Learning by Joshi, Vinay et al.
ESSOP: Efficient and Scalable Stochastic Outer
Product Architecture for Deep Learning
Vinay Joshi∗†, Geethan Karunaratne∗§, Manuel Le Gallo∗, Irem Boybat∗‡, Christophe Piveteau∗§,
Abu Sebastian∗, Bipin Rajendran¶ and Evangelos Eleftheriou∗
Email: vinayjoshi.iitb@gmail.com, {kar,anu,ibo,piv,ase,ele}@zurich.ibm.com, bipin.rajendran@kcl.ac.uk
∗IBM Research - Zurich, 8803 Ru¨schlikon, Switzerland
†New Jersey Institute of Technology (NJIT), Newark, NJ 07102, USA
‡Ecole Polytechnique Federale de Lausanne (EPFL), 1015 Lausanne, Switzerland
§ETH Zu¨rich, 8092 Zu¨rich, Switzerland
¶King’s College London, Strand, London WC2R 2LS, United Kingdom
Abstract—Deep neural networks (DNNs) have surpassed
human-level accuracy in a variety of cognitive tasks but at the
cost of significant memory/time requirements in DNN training.
This limits their deployment in energy and memory limited
applications that require real-time learning. Matrix-vector mul-
tiplications (MVM) and vector-vector outer product (VVOP) are
the two most expensive operations associated with training of
DNNs. Strategies to improve the efficiency of MVM computation
in hardware have been demonstrated with minimal impact on
training accuracy. However, the VVOP computation remains a
relatively less explored bottleneck even with the aforementioned
strategies. Stochastic computing (SC) has been proposed to
improve the efficiency of VVOP computation but on relatively
shallow networks with bounded activation functions and floating-
point (FP) scaling of activation gradients. In this paper, we
propose ESSOP, an efficient and scalable stochastic outer product
architecture based on the SC paradigm. We introduce efficient
techniques to generalize SC for weight update computation in
DNNs with the unbounded activation functions (e.g., ReLU),
required by many state-of-the-art networks. Our architecture
reduces the computational cost by re-using random numbers
and replacing certain FP multiplication operations by bit shift
scaling. We show that the ResNet-32 network with 33 convolution
layers and a fully-connected layer can be trained with ESSOP on
the CIFAR-10 dataset to achieve baseline comparable accuracy.
Hardware design of ESSOP at 14 nm technology node shows
that, compared to a highly pipelined FP16 multiplier design,
ESSOP is 82.2% and 93.7% better in energy and area efficiency
respectively for outer product computation.
I. INTRODUCTION
Research on developing accelerators for training deep neural
networks (DNNs) has attracted significant interest. Several
potential applications such as autonomous navigation, health
care, and mobile devices require learning in-the-field while
adhering to strict memory and energy budgets. DNN training
demands significant time and compute/memory. The two most
expensive computations in DNNs are the matrix-vector mul-
tiplications (MVM) and vector-vector outer product (VVOP)
and both require O(N2) multiplications for a layer with a
weight matrix of the size N×N . Several strategies to improve
efficiency of MVM computation have been proposed with min-
imal impact on the training accuracy. These strategies leverage
either low precision digital representation [1], [2] or crossbar
architectures [3], [4], [5], [6]. Less precise implementations
of MVM are shown to perform sufficiently well for DNN
training [1], [2], [7], [8], [9], [10].
For improving the efficiency of VVOP computation to
calculte the weight updates, the algorithmic ideas that have
been proposed so far require expensive multiplier circuits [11],
[12], [13]. Stochastic computing (SC) has been suggested as an
efficient alternative to floating point (FP) multiplications, given
that operands are real numbers in [0, 1]. This poses challenges
for DNN training as operations such as ReLu, batchnorm, etc.
have unbounded outputs. Moreover, small error gradients are
often quantized to 0 due to the limited precision in the range
[0,1].
The main contributions of this paper are the following: (1)
We propose an SC-based efficient architecture ESSOP for
computing weight updates for DNN training. (2) We intro-
duce efficient schemes to generalize SC-based multiplier to
unbounded activation functions (e.g. ReLU) that are essential
for DNN training [14], [15], [16]. (3) We show that these
improvements have minimal effect on training accuracy of
a deep convolution neural network (CNN). (4) Post place
and route results at 14 nm CMOS show that ESSOP design
is 82.2 % and 93.7 % better in energy and area efficiency
respectively, compared to a highly pipelined FP16 multiplier
design for outer product computation.
II. BACKGROUND AND MOTIVATION
A. Neural Network Training
DNN training proceeds in three phases, namely (1) forward
propagation, (2) backpropagation and (3) weight update. As
shown in Fig. 1, MVM operation is essential in forward and
backpropagation while during the weight update phase, the
VVOP is computed between the error gradient ∆ of that layer
and the output activations of the previous layer X to calculate
the weight update matrix δW (see Eq. (1)). Note that this δW
calculation, in general, applies to both fully-connected and
convolution layers as well [17].
δW = ∆×XT (1)
Copyright c© 2020 IEEE. Personal use of this material is permitted. However, permission to use this
material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org.
This paper has been accepted at ISCAS 2020 for publication.
ar
X
iv
:2
00
3.
11
25
6v
1 
 [c
s.L
G]
  2
5 M
ar 
20
20
XY = W1XW1
Δ2
W2
Δ1 = W2Δ2 δW1 = Δ1X
Matrix-vector
multiplication
Vector-vector
outer product
(a) (b)
T
Efficient hardware implementation is possible by Less explored problem,
memristive crossbar, low precision weights, etc. proposed solution ESSOP
X
(MVM) (VVOP)
T
Fig. 1. Illustration of the two bottleneck operations in DNN training. Each
of the illustrated operations require O(N2) multiplications. Unlike MVM,
VVOP is a relatively less explored problem that eventually becomes the
bottleneck even when MVM is efficiently implemented in hardware. The
ESSOP architecture addresses this problem to enable efficient hardware
implementation of DNN training.
B. Stochastic Computing (SC)
SC is a method of computing arithmetic operations using
the stochastic representation of real numbers constrained to
the interval [0, 1], instead of using real valued operands [18],
[19], [20]. For notational convenience, we denote scalars by
lowercase letters (e.g. x ∈ IR) and vectors by uppercase letters
(e.g. X ∈ IRN). To compute the stochastic representation
Ψr (∈ {0, 1}M ) of a real number r, where r ∈ [0, 1], a
Bernoulli sequence with M binary bits is computed such that
the probability that any one of these bits being 1 is equal to
r, i.e., P (Ψkr = 1) = r for k = 1, 2, . . . ,M .
Using this representation, the product of two real numbers
c = r×b, where b ∈ [0, 1], can be computed using the bit-wise
AND operation on the Bernoulli sequences as
P (Ψkc = 1) = P (Ψ
k
r = 1) ∧ P (Ψkb = 1) ∀k = 1 : M (2)
and
c ∼ E[Ψc] = E[Ψr ∧Ψb] (3)
Equation (3) thus replaces the expensive floating point mul-
tiplications with bitwise AND operations and subsequent
summation operations. However, the range of numbers being
multiplied in DNNs is usually not confined to [0, 1]. Equations
(2) and (3) can be generalized to numbers of arbitrary range;
we illustrate this assuming X and ∆ both have N elements
and weight update is determined using VVOP computation as
given by equation (1). Assuming that the vector X lies in the
range of [−xmax, xmax] and similarly the error gradient ∆
lies in the range [−δmax, δmax], we first normalize both X
and ∆ vectors to constrain their values to [−1, 1] as,
X¯ =
X
xmax
∆¯ =
∆
δmax
(4)
Next, we denote the stochastic representation of all elements
in a vector X as ΨX (∈ {0, 1}N×M ). For computing the
Bernoulli sequences of ΨX¯ and Ψ∆¯ in hardware, we can
implement a random number generator (RNG) to sample from
the uniform distribution of [0, 1] and compare the normalized
real number with the sampled random number:
Ψi,k
ζ¯
= |ζ¯i| ≥ RNGkζ¯ ∀k = 1 : M (5)
In (5), Ψi,k
ζ¯
is the kth Bernoulli event of the ith element of
ζ¯ obtained by comparing ith element of ζ¯ with kth sample
from corresponding random number generator RNGk
ζ¯
, where
ζ is X or ∆. We can approximate the product in equation (1)
using SC as,
δW ji = sign(δW ji)× Fscale ×
∑
k=1:M
Ψj,k
∆¯
∧Ψi,k
X¯
(6)
where the parameter Fscale is defined as (xmax × δmax/M).
From equations (4)-(6), it is clear that a SC-based multiplier
implementation for VVOP calculation presents the following
challenges: (i) determination of the maximum elements xmax
and δmax of the vectors X and ∆ respectively in equation (4),
requiring O(N) floating point comparisons; (ii) floating point
division for normalization in equation (4) that requires O(N)
floating point division operations; (iii) computation in equa-
tion (5) requires O(MN) random number generations; and
(iv) scaling by Fscale in equation (6) requires O(N2) FP
multiplication operations. We now discuss several techniques
to address these challenges.
III. OPTIMIZATION OF THE SC-BASED MULTIPLIER FOR
THE DESIGN OF ESSOP ARCHITECTURE
A. Eliminating the normalization operations
As discussed above, the normalization operation introduces
O(N) floating point divisions. This can be addressed by the
following improvement. Consider a number z that lies in the
range [0, 1] and another real number y that is obtained by
using a constant positive scaling factor ymax as y = z ×
ymax. A Bernoulli representation (Ψkz ) of z is obtained as
z ≥ RNGkz ;∀k = 1 : M and that of y is,
Ψky = [y ≥ ymax ×RNGkz ] ∀k = 1 : M (7)
In equation (7), RNG can be realized by using a linear
feedback shift register (LFSR) circuit to generate p-bit pseudo-
random numbers. Notably, the hardware realization of equa-
tion (7) does not require a multiplication with ymax and can
be realized by sampling few bits from the LFSR. For example,
in a floating point representation, 2th power in ymax can
be used as an exponent and RNGkz as mantissa to compute
ymax × RNGkz without any floating point multiplications.
Alternatively, in a fixed point representation, only a fraction of
the bits generated from the LFSR need to be used to eliminate
the fixed point multiplications in equation (7). For instance, if
ymax requires only 8-bits in the fixed point representation,
8-bits could be sampled from the p-bit LFSR to compute
ymax×RNGkz . This eliminates the need for O(N) expensive
divisions irrespective of the numerical representation of y.
B. Reusing the generated random numbers
The next hurdle for a SC-based multiplier is the require-
ment to create O(MN) uniformly distributed p-bit random
numbers. In order to efficiently utilize the generated random
numbers, we propose to reuse the random numbers, for the
computations in equation (5), N times by generating only
M random numbers. Hence, we generate only 2M random
Counter
4-bits
Shift 
logic
16-bits
La
tc
h1-bit
1-bit
1-bit
1-bit
1-bit
1-bit 5-bits
F �scale
 sign(Xi)
 sign(Δj)
Ψj,kΔ
Ψi,kX
 
(b)
Sign bit
F �scale Counter outputXOR
Exponent bits Mantissa bits
(a)
Fig. 2. The high-level design of a generalized ESSOP multiplier (unit cell)
that can operate on two floating point numbers. (a) Internal blocks and the
bit-lengths are shown for a unit cell implementation with FP16 inputs (X
and ∆) and 16-bit sequence length. Superscirpts i, j denote the index of a
real number in a vector and k denotes the index of a Bernoulli sample in a
stochastic representation. (b) Example implementation of the shift logic for a
floating point representation.
numbers from the LFSR instead of 2NM random numbers for
the 2N elements in X and ∆ combined. The first M random
numbers are used to generate the Bernoulli sequences of all
the elements in the vector X and the remaining M random
numbers are used to generate the Bernoulli sequences of all
the elements in the vector ∆. For example, to generate 8-bit
long Bernoulli sequences corresponding to 256-element long
vectors X and ∆, only 16 random numbers are generated. The
first 8 random numbers will be used to generate the Bernoulli
sequence of all elements in the vector X and the remaining 8
for that of vector ∆. With this modification, random number
generation complexity is reduced toO(M), making it indepen-
dent of the dimensions of the weight matrix. As our detailed
network simulations indicate, unintended correlations are not
introduced by reusing random numbers, as the two Bernoulli
sequences in equation (6) are uncorrelated.
C. Approximating the scaling operations
Scaling the result of the AND operation in equation (6)
with Fscale is an O(N2) operation involving full-precision
multiplications. To efficiently realize this scaling operation in
hardware, we propose to use the closest 2th power of the
number Fscale obtained as shown below,
F˜scale = 2
blog2(Fscale)c (8)
where bxc denotes the largest integer smaller than x. Using
F˜scale instead of Fscale makes the computations in equa-
tion (6) straightforward as only bit shift operations are re-
quired. Usually, Fscale in DNNs will be smaller than 1 and
hence such a bit-shift operation will mostly be a right shift
operation. In the case of stochastic gradient descent optimizer,
note that the learning rate can also be accommodated in the
Fscale computation.
IV. THE ESSOP ARCHITECTURE
A. Unit cell design
We leverage innovations from Section III to develop the
architecture of a single SC-based multiplier that we refer to as
an ESSOP unit cell, shown in Fig. 2(a). At the periphery of the
unit cell, M -bit stochastic sequences ψiX and ψ
j
∆ are computed
for two input real numbers Xi and ∆j respectively. The unit
cell receives two inputs each with 2 bits representation, with
the first bit representing the stochastic representation of a real
number and the second bit is the sign of the real number. The
sign of the final product is computed using a 1-bit XOR on
the sign bits of two real numbers. In our design, we assume
a simple 2-input AND gate that is used M times to compute
the M -bit representation of SC-based multiplication. For each
cycle out of M cycles, the output of the AND gate is fed
to a counter that counts the number of 1s in the resulting
sequence. In the periphery of the unit cell, F˜scale is computed,
which is used by the shift logic circuit to scale the output
of the counter to a desired range. Shift logic will depend
on the digital representation used for the input real numbers.
For example, for a floating point representation as shown in
Fig. 2(b), shift logic is as simple as copying the output of
the XOR to the sign bit position, the counter output to the
mantissa position and F˜scale value to the exponent position.
Finally, the result of the shift logic circuit is stored in a latch
(or in a desired memory location) for the weight update. Note
that the SC-based multiplier requires only M clock cycles to
compute one product.
B. The multi-cell architecture of ESSOP
To compute all the elements of an outer product matrix,
either a unit cell can be multiplexed or multiple unit cells can
be used simultaneously. The proposed ESSOP architecture
has multiple unit cells stacked in a single row. The high-
level architecture of ESSOP is shown in Fig. 3, and has N
unit cells (U) arranged in a row. At the periphery of the unit
cells, all inputs have their corresponding comparator (C). Each
comparator receives two inputs, one from either an element
of an activation vector (X) or error gradient vector (∆),
second from a corresponding random number generator (R).
The comparator generates one stochastic bit by comparing two
CX1
U1
CΔ
RΔ
U2 UN
CX2 CXN
G
RX
EX
EΔ
Δj
Configuration
Activations Vector Elements
Outer Product Matrix Elements
Bernoulli bit
1 2 N
N
δW
2
δW
1
δW
jth
 E
rro
r G
ra
di
en
t
XNX2X1
j,1 j,2 j,N
R1 R2 RN
RD
Bit assembly
Sign bit
Mantissa bits
Exponent bits
Fig. 3. The ESSOP architecture for computing the N multiplications in
parallel with the N SC-based multipliers, assuming X and ∆ have floating
point 16-bit precision. R, C and U stand for random number generator,
comparator and unit cell respectively. This architecture can be reused to
compute one full outer product. This also represent the high-level architecture
used for our silicon design.
A
c c
u
r a
c y
( i
n
%
)
Fig. 4. The accuracy of the ResNet-32 network as a function ESSOP with the
VVOP inputs represented using 16, 8, and 2-bits long Bernoulli sequences.
16-bit sequence is enough to achieve baseline comparable accuracy and 2-bit
representation suffers only 2.6% drop on an average compared to the baseline.
Each box plot shows five independent runs with different seeds.
inputs at a time. In the M -bit implementation of the ESSOP,
several such random numbers could be fed to the comparator
circuit to compute M comparisons in total, resulting in an
M -bit long stochastic sequence. As discussed in Section III-B,
there are exactly two RNGs, one for X and another for ∆, that
generate M random numbers each for every outer product.
V. NUMERICAL VALIDATION
We train the ResNet-32[14] network on CIFAR-10 [21]
dataset to validate the ESSOP architecture. We use ESSOP to
compute weight updates in all the 33 convolution layers and
in the final layer of the ResNet-32 network. The CIFAR-10
dataset has 50K images in the training set and 10K images
in the test set. Each image in the dataset has a resolution of
32× 32 pixels and three channels and belongs one of the ten
classes. We preprocess CIFAR-10 images by implementing
the commonly used image processing steps for the family
of residual networks as reported in [22]. The simulation was
performed with FP16 precision for data and computation, with
a mini-batch size of 100 images for 200 epochs. We used
initial learning rate (LR) of 0.1 with LR evolution (LRE) for
baseline as in [14]; LRE is tuned for better accuracy in ESSOP
implementation. The categorical cross-entropy loss function is
minimized using stochastic gradient descent with momentum
of 0.9. In our results, we denote ESSOP16(M) to indicate
FP16 precision for input operands (X and ∆) represented
using M -bit Bernoulli sequences. Fig. 4 shows the accuracy
of ResNet-32 as a function of ESSOP16 sequence length. The
test accuracy drop with ESSOP16(16) is only 0.25% compared
to the baseline. Experiments on different sequence lengths
indicate that 16-bits is sufficient to achieve close to baseline
accuracy with FP16 outer product. ESSOP16(16) shows on an
average of 0.73% drop in the test accuracy compared to the
baseline accuracy. ESSOP16(8) has on an average of 1.13%
drop in test accuracy compared to the baseline. Remarkably,
ESSOP16(2) has on an average of only 2.6% drop in the test
accuracy compared to the baseline. It is important to note that
with a sequence length of 2-bits, it is possible to compute the
weight update in just 2 clock cycles.
VI. POST LAYOUT RESULTS
In this section, we present the details of hardware implemen-
tation of ESSOP16 and compare its post route layout perfor-
Physical Characteristics ESSOP16(16) FP16 
Technology 
Routing 
Core voltage 
Samsung 14nm LPP 
14 metal layers 
0.72V 
Design specific physical characteristics 
Core area (µm2) 2676 43552 
Logic complexity (Kilo Gate 
Equivalent or kGE) 
11 180 
Performance and efficiency 
Max. clock frequency (GHz) 2.54 1.34 
Power @ max frequency (mW) 19.1 298 
Peak throughput (GOp/s) 10.2 85.4 
Core energy efficiency (GOp/s/w) 523 (82.2% ) 287 
Core area efficiency (GOp/s/MGE) 920 (93.7%) 475 
Fig. 5. Physical and performance characteristics of ESSOP16(16) architecture
vs floating point 16-bits (FP16) multiplier array.
mance with that of a highly pipelined FP16 multiplier design
after place and route. FP16 design is an array of N = 64 FP16
multipliers (from Samsung’s Low Power Plus (LPP) library)
that compute 64 FP16 multiplications in parallel. Similarly,
ESSOP16 design has 64 unit cells (U1 to U64) to compute 64
elements of an outer product matrix in parallel, as illustrated in
Fig. 3. Inputs X and ∆, which are in FP16 precision, are fed to
the corresponding comparator circuit (C1 to C64). Two RNGs
RX and R∆ corresponding to X and ∆ generate M random
numbers each. Each RNG generates only mantissa part of the
random number, exponent (EX or E∆) is derived from the
absolute maximum number in the vector X or ∆ and sign bit
is derived from another input of a corresponding comparator.
Configuration parameters such as Bernoulli sequence length
is stored in configuration register G. Post place and route
results at 14 nm using Samsung LPP libraries are shown in
Fig. 5. These results indicate ESSOP16(16) design, even
though sequential and not pipelined, operates at 1.9× higher
frequency and achieves 82.2 % and 93.7 % better energy and
area efficiency respectively, compared to the FP16 multiplier
array for outer product computation.
VII. CONCLUSION
We proposed an efficient hardware architecture ESSOP that
facilitates training of deep neural networks. The central idea
is to efficiently implement the vector-vector outer product
calculation associated with the weight updates using stochastic
computing. We proposed efficient schemes to implement the
stochastic computing-based multipliers that can generalize to
operands with unbounded magnitude and significantly reduce
the computational cost by re-using random numbers. This ad-
dresses a significant performance bottleneck for training DNNs
in hardware, particularly for applications with stringent con-
straints on area and energy. ESSOP complements architectures
that accelerate matrix-vector multiply operations associated
with the forward and backpropagations where weights are
represented in low precision or stored in computational mem-
ory based crossbar array architectures. We evaluated ESSOP
on a 32-layer deep CNN that achieves baseline comaprable
accuracy for a sequence length of 16-bits. 14 nm place and
route of the ESSOP architecture compared with FP16 design
shows 82.2 % and 93.7 % improvement in energy and area
efficiency respectively for outer product computation.
ACKNOWLEDGMENT
This project was supported partially by the Semiconductor
Research Corporation.
REFERENCES
[1] M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training
deep neural networks with binary weights during propagations,” CoRR,
vol. abs/1511.00363, 2015. [Online]. Available: http://arxiv.org/abs/
1511.00363
[2] F. Li and B. Liu, “Ternary weight networks,” CoRR, vol.
abs/1605.04711, 2016. [Online]. Available: http://arxiv.org/abs/1605.
04711
[3] G. W. Burr, R. M. Shelby, A. Sebastian, S. Kim, S. Kim, S. Sidler,
K. Virwani, M. Ishii, P. Narayanan, A. Fumarola et al., “Neuromorphic
computing using non-volatile memory,” Advances in Physics: X, vol. 2,
no. 1, pp. 89–124, 2017.
[4] V. Joshi, M. L. Gallo, I. Boybat, S. Haefeli, C. Piveteau,
M. Dazzi, B. Rajendran, A. Sebastian, and E. Eleftheriou, “Accurate
deep neural network inference using computational phase-change
memory,” CoRR, vol. abs/1906.03138, 2019. [Online]. Available:
http://arxiv.org/abs/1906.03138
[5] A. Sebastian, I. Boybat, M. Dazzi, I. Giannopoulos, V. Jonnalagadda,
V. Joshi, G. Karunaratne, B. Kersting, R. Khaddam-Aljameh, S. R.
Nandakumar, A. Petropoulos, C. Piveteau, T. Antonakopoulos, B. Ra-
jendran, M. L. Gallo, and E. Eleftheriou, “Computational memory-based
inference and training of deep neural networks,” in 2019 Symposium on
VLSI Technology, June 2019, pp. T168–T169.
[6] A. Sebastian, M. Le Gallo, G. W. Burr, S. Kim, M. BrightSky, and
E. Eleftheriou, “Tutorial: Brain-inspired computing using phase-change
memory devices,” Journal of Applied Physics, vol. 124, no. 11, p.
111101, 2018.
[7] K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara,
S. Takamaeda-Yamazaki, M. Ikebe, T. Asai, T. Kuroda, and M. Mo-
tomura, “BRein memory: A single-chip binary/ternary reconfigurable
in-memory deep neural network accelerator achieving 1.4 TOPS at 0.6
W,” IEEE Journal of Solid-State Circuits, vol. 53, no. 4, pp. 983–994,
April 2018.
[8] S. R. Nandakumar, M. L. Gallo, I. Boybat, B. Rajendran, A. Sebastian,
and E. Eleftheriou, “Mixed-precision training of deep neural networks
using computational memory,” CoRR, vol. abs/1712.01192, 2017.
[Online]. Available: http://arxiv.org/abs/1712.01192
[9] E. Nurvitadhi, D. Sheffield, Jaewoong Sim, A. Mishra, G. Venkatesh,
and D. Marr, “Accelerating binarized neural networks: Comparison of
FPGA, CPU, GPU, and ASIC,” in 2016 International Conference on
Field-Programmable Technology (FPT), Dec 2016, pp. 77–84.
[10] G. Tayfun and Y. Vlasov, “Acceleration of deep neural network training
with resistive cross-point devices,” CoRR, vol. abs/1603.07341, 2016.
[Online]. Available: http://arxiv.org/abs/1603.07341
[11] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural
networks with weights and activations constrained to +1 or -
1,” CoRR, vol. abs/1602.02830, 2016. [Online]. Available: http:
//arxiv.org/abs/1602.02830
[12] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural
networks with few multiplications,” CoRR, vol. abs/1510.03009, 2015.
[Online]. Available: http://arxiv.org/abs/1510.03009
[13] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of
neural networks on CPUs,” in Deep Learning and Unsupervised Feature
Learning Workshop, NIPS 2011, 2011.
[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” CoRR, vol. abs/1512.03385, 2015. [Online]. Available:
http://arxiv.org/abs/1512.03385
[15] G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected
convolutional networks,” CoRR, vol. abs/1608.06993, 2016. [Online].
Available: http://arxiv.org/abs/1608.06993
[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
convolutions,” CoRR, vol. abs/1409.4842, 2014. [Online]. Available:
http://arxiv.org/abs/1409.4842
[17] T. Gokmen, M. Onen, and W. Haensch, “Training deep convolutional
neural networks with resistive cross-point devices,” Frontiers in
Neuroscience, vol. 11, p. 538, 2017. [Online]. Available: https:
//www.frontiersin.org/article/10.3389/fnins.2017.00538
[18] A. Alaghi and J. P. Hayes, “Survey of stochastic computing,” ACM
Trans. Embed. Comput. Syst., vol. 12, no. 2s, pp. 92:1–92:19, May
2013. [Online]. Available: http://doi.acm.org/10.1145/2465787.2465794
[19] B. R. Gaines, “Stochastic computing,” in Proceedings of the April
18-20, 1967, Spring Joint Computer Conference, ser. AFIPS ’67
(Spring). New York, NY, USA: ACM, 1967, pp. 149–156. [Online].
Available: http://doi.acm.org/10.1145/1465482.1465505
[20] W. J. Poppelbaum, C. Afuso, and J. W. Esch, “Stochastic computing
elements and systems,” in Proceedings of the November 14-16,
1967, Fall Joint Computer Conference, ser. AFIPS ’67 (Fall). New
York, NY, USA: ACM, 1967, pp. 635–644. [Online]. Available:
http://doi.acm.org/10.1145/1465611.1465696
[21] A. Krizhevsky, “Learning multiple layers of features from tiny images,”
University of Toronto, 05 2012.
[22] T. Devries and G. W. Taylor, “Improved regularization of convolutional
neural networks with cutout,” CoRR, vol. abs/1708.04552, 2017.
[Online]. Available: http://arxiv.org/abs/1708.04552
