Xcel-RAM: Accelerating Binary Neural Networks in High-Throughput SRAM
  Compute Arrays by Agrawal, Amogh et al.
Xcel-RAM: Accelerating Binary Neural Networks
in High-Throughput SRAM Compute Arrays
Amogh Agrawal*, Akhilesh Jaiswal*, Deboleena Roy, Bing Han, Gopalakrishnan Srinivasan, Aayush Ankit,
Kaushik Roy, Fellow, IEEE
Abstract—Deep neural networks are biologically-inspired class
of algorithms that have recently demonstrated state-of-the-art
accuracy in large scale classification and recognition tasks.
Hardware acceleration of deep networks is of paramount impor-
tance to ensure their ubiquitous presence in future computing
platforms. Indeed, a major landmark that enables efficient
hardware accelerators for deep networks is the recent advances
from the machine learning community that have demonstrated
the viability of aggressively scaled deep binary networks. In
this paper, we demonstrate how deep binary networks can be
accelerated in modified von-Neumann machines by enabling
binary convolutions within the SRAM array. In general, binary
convolutions consist of bit-wise XNOR followed by a population-
count (popcount). We present two proposals − one based on
charge sharing approach to perform vector XNORs and ap-
proximate popcount and another based on bit-wsie XNORs
followed by a digital bit-tree adder for accurate popcount. We
highlight the various trade-offs in terms of circuit complexity,
speed-up and classification accuracy for both the approaches.
Few key techniques presented as a part of the manuscript is
use of low-precision, low overhead ADC, to achieve a fairly
accurate popcount for the charge-sharing scheme and proposal
for sectioning of the SRAM array by adding switches onto
the read-bitlines, thereby achieving improved parallelism. Our
results on a benchmark image classification dataset CIFAR-
10 on a binarized neural network architecture show energy
improvements of 6.1× and 2.3× for the two proposals, compared
to conventional SRAM banks. In terms of latency, improvements
of 15.8× and 8.1× were achieved for the two respective proposals.
Index Terms—In-memory computing, SRAM, binary convolu-
tion, binary neural networks, deep-CNNs.
I. INTRODUCTION
Deep convolutional neural networks (CNNs) have been
established as the state-of-the-art for recognition and clas-
sification tasks [1], [2], often surpassing human capabilities
[3]–[5]. Most popular networks that won the ImageNet [6]
challenge, such as AlexNet [7], GoogLeNet [8], ResNet [4],
etc., are based on deep CNNs. However, hardware running
these networks consume large amounts of energy, in fact,
orders of magnitude more than the human brain [9]. This
immense energy-gap stems from the underlying architecture
of the current state-of-the-art hardware implementations, that
are variants of the von-Neumann machines [10]. They con-
tain physically separate computation and memory blocks,
connected via a system bus. Although this architecture has
worked wonders for general-purpose computing tasks, when
it comes to deep CNNs and data intensive applications in
The authors are with the School of Electrical and Computer Engineering,
Purdue University, West Lafayette, IN-47907, USA
(* These authors contributed equally)
general, frequent data transfers between the memory and the
computation unit becomes a bottleneck, given the limited
bandwidth of the bus. Moreover, since each transaction is
expensive, a large power penalty is incurred per memory
access.
Recent developments in the neural network community have
identified these problems and have come up with simpler
memory-friendly algorithms. Binary neural networks [11]–
[13] and XNOR-nets [14] have been recently developed and
shown large potential. The idea is to reduce the precision of
input activations and the network weights to single-bit. This
immensely simplifies the computations to Boolean bit-wise
operations, with only minimal degradation in the state-of-the-
art accuracies. Since convolution is the most power-hungry
operation in neural networks, it is reduced to a bit-wise XNOR
followed by a population count (popcount) of the XNORed
output. This opens pathways for adopting new simplified bi-
nary in-memory computing paradigms for accelerating neural
networks. As shown in [15]–[17], bit-wise Boolean operations
including XORs or XNORs as well as non-Boolean vector-
matrix dot-products can easily be incorporated within standard
SRAM arrays. Such SRAM based in-memory computations
open up new possibilities of augmenting the existing memory
arrays with compute capabilities. Thereby, one can imagine
a modified von-Neumann machine, which can cater well to
general purpose computing tasks as well as act as on-demand
compute accelerator.
To that effect, we propose novel techniques to compute in-
memory binary convolutions, as an added functionality to the
standard 10-transistor (T) SRAM bitcells. In the first approach
(Proposal-A), we use charge-sharing between the parasitic
capacitances present inherently in the SRAM array to perform
the XNOR and popcount operations involved in the binary
convolution. Although this approach is digital, with binary
weights and binary inputs stored in the memory array, the
popcount is generated as an analog voltage on the source-
lines. In order to sense this analog voltage, we propose a
low-overhead and low-precision ADC (owing to area and
energy constraints in the memory array). Another key highlight
of this approach is that we employ a sectioned-SRAM by
dividing memory sub-banks into smaller sections. With n-
sections in a particular sub-bank we can accomplish n-binary
convolutions in parallel. This is important because obtaining
the popcount output for large kernels is non-trivial. For large
networks, the kernel sizes in deeper layers are typically too
large to be stored in a single row of a given memory sub-
array. As such, popcount for larger binary networks inevitably
ar
X
iv
:1
80
7.
00
34
3v
2 
 [c
s.E
T]
  2
2 O
ct 
20
18
requires a scheme to estimate the partial popcount from each
row, which can then be summed up from different sub-arrays
to get the final popcount. However, the low-overhead and
low-precision ADC induces approximations in the popcount
output, which results in overall system accuracy degradation.
Thus, we propose another approach (Proposal-B), where we
alter the peripheral circuitry of the SRAM array and enable
two word-lines simultaneously. This approach, although not as
energy/throughput efficient as Proposal-A, generates accurate
XNOR and popcount operations, thereby, not affecting the
overall system accuracy. The proposed circuit techniques in
Proposal-A and Proposal-B allows us to process multiple ker-
nels at once, thereby improving the overall system throughput
and making the proposal suitable for a range of deep binary
networks.
There have been several earlier works to develop hard-
ware platforms that can accelerate CNN algorithms. Hard-
ware architectures that use highly sub-banked memory units
feeding an array of multiply-accumulate processing elements
have been presented in many works including [18]–[20].
A key drawback of such distributed processing array based
customized design is the fact that it makes the underlying
computing hardware application specific and in many cases
specific to neural network accelerators. Further, emerging
technologies like memristive crossbars have been employed
in many proposals as convolution accelerators geared towards
neural networks in general [21], [22]. The very use of memris-
tors as convolution engine renders such platforms unsuitable
for general purpose computing due to various challenges faced
by memristive state-of-the-art technologies. These include,
the limited endurance of memristive devices, the multi-cycle
write-verify programming scheme [23] and the drift in pro-
grammed resistance state with aging [24]. More recently, [25]
demonstrated an analog approach to binary convolution using
charge-sharing. However, the work presented in [25] was lim-
ited to smaller networks. This is because with larger and more
complex networks, the inaccuracies in interfacing conventional
low precision DACs/ADCs unacceptably degrades the network
accuracy.
The main highlights of the present work are as follows:
1) We present two novel techniques to compute binary con-
volutions. Proposal-A uses charge-sharing between the
parasitic capacitances inherently present within the stan-
dard 10T-SRAM array, to accomplish a fairly accurate
popcount operation. Proposal-B alters the SRAM periph-
eral circuitry to perform accurate in-memory XNOR and
popcount operations.
2) Further, we propose sectioned-SRAM to increase par-
allelism within the SRAM arrays, thereby improving
the computation throughput and energy-efficiency of the
binary convolution operation.
II. IN-MEMORY BINARY CONVOLUTION − PROPOSAL-A
As discussed in the introduction, a convolution operation is
simplified to a bitwise XNOR, followed by a popcount of the
XNORed output in binary neural networks (BNNs). Although
bitwise XNOR operation is simple to incorporate within the
Fig. 1. The 10 transistor SRAM cell featuring a differential decoupled
read-port comprising of transistors M1-M2 and M1’-M2’. The write port is
constituted by write access transistors connected to WWL.
memory, the popcount operation is not very straightforward.
We exploit the inherent SRAM structure, utilizing the internal
parasitic capacitances to perform the XNOR and popcount
of two vectors stored within the memory array. Although
our approach to binary convolution is digital, we sense an
analog voltage within the memory array to evaluate the pop-
count output. Sensing analog voltages in general, is difficult
without precise ADCs. Most common precise ADCs, such as
Flash ADCs and SAR type ADCs require excessively large
power and area [26], making them unsuitable for memory
applications. Thus, we propose a dual read-wordline (Dual-
RWL) along with a dual-stage ADC to minimize the errors
in the popcount output. Further, we describe the sectioned-
SRAM technique to improve the throughput and the energy-
efficiency of the binary convolution. Since the same set of
inputs need to be convolved with multiple kernels, each
section in sectioned-SRAM stores a different kernel while the
inputs are shared among all sections, thereby performing the
operations concurrently.
A. Circuit Description
We use a standard 10T-SRAM cell as the basic memory unit.
Fig. 1 shows a schematic of the 10T-SRAM cell, containing
the basic 6T-cell as the storage unit, along with transistors
M1-M2 and M1’-M2’ forming the differential read ports,
respectively. Writing into the cell is functionally similar to
the 6T write operation through the write-ports (WWL, BL,
BLB). For reading, RBL and RBLB are pre-charged to VDD,
SL is connected to ground, and RWL is enabled. If the bit-cell
stores a ‘1’ (Q = VDD, QB = 0V), RBL discharges to 0V and
RBLB holds its charge. Similarly, if the bit-cell stores a ‘0’ (Q
= 0V, QB = VDD), RBLB discharges to 0V and RBL holds
its charge. A differential sense amplifier senses the voltage
difference between RBL and RBLB to generate the output.
We use the inherent parasitic capacitances on RBLs, RBLBs
and SLs (CRBL, CRBLB and CSL, respectively) in the 10T-
SRAM structure to compute the binary convolution within the
memory array itself. The operation can be described in three
steps as follows:
a) Pseudo-read: A read operation is performed on a
row storing the binary vector inputs, say A1 (refer Fig. 2(a)).
Fig. 2. Illustration of the binary convolution operation within the 10T-SRAM array. a) Step 1: Pseudo-read. RBLs/RBLBs are pre-charged and RWL for a
row storing the input activation (A1) is enabled. Depending on the data A1, RBLs/RBLBs either discharge or stay pre-charged. The SAs are not enabled, in
contrast to a usual memory read. Thus, the charge on RBLs/RBLBs represent the data A1. b) Step 2: XNOR on SL. Once the charges on RBLs/RBLBs have
settled, RWL for the row storing the kernel (K1) is enabled. Charge sharing occurs between the RBLs/RBLBs and the SL, depending on the data K1. The
RBLs either deposit charge on the SL, or take away charge from SL. c) The truth table for Step 2 is shown. The pull-up and pull-down of the SL follow the
XNOR truth table. Moreover, since the SL is common along the row, the pull-ups and pull-downs are cumulative. Thus, the final voltage on SL represents
the XNOR + popcount of A1 and K1.
Fig. 3. a) Dual RWL technique. The left block shows the memory array with Dual RWL. Each row in the memory array consists of two read-wordlines
RWL1 and RWL2. Half of the cells along the row are connected to RWL1, while the other half are connected to RWL2. At a time, only one of RWL1,
RWL2 are enabled, to ensure that only half of the cells participate in charge sharing, at a time, thereby reducing the number of voltage states on the SL
to be sensed. b) Dual-stage ADC scheme. The ADC consists of two dummy bitcells (only one shown), two SAs, counter and a control block. The ADC
control block generates the reference signals VREFN and VREFP , and SAE, which are fed to the two SAs. These are used in the first-stage of the ADC
to determine the sub-class (first 2bits of ADC output). It also generates the signals PCH0, PCH1, WLADC which operate on the dummy cells during
the second-stage, to either pump-in or pump-out charge from SL, depending on the sub-class. The counter counts the number of cycles during the process to
generate the final 3bits of the ADC output.
First, all RBLs/RBLBs are precharged to VDD, as in the usual
read operation. Next, when the RWL corresponding to the
row storing A1 is enabled, the precharged RBLs and RBLBs
discharge conditionally, depending on the data values, thereby
stabilizing at VDD or 0V. For the example shown in the figure,
the data stored is ‘1’ in both cells corresponding to the input
vector A1, thus, both RBLs discharge to 0V and RBLBs stay
at VDD. Note that the differential sense-amplifiers are not
enabled in this pseudo-read step.
b) XNOR on SL: After the pseudo-read operation, the
RBLs/RBLBs store the information of A1 as their respective
voltages. Now, the RWL of the row storing a weight kernel,
say K1, is enabled (refer Fig. 2(b)). Interestingly, this causes
charge-sharing between CRBL, CRBLB and CSL as shown in
the figure by the charge current paths. In the example, the
two cells corresponding to K1 store a ‘0’ and ‘1’ respectively.
Thus, when the RWL is enabled, charge flows into the SL
from M1’ in the left cell, while charge flows out of SL through
M1 in the right cell. This ‘pull-up’(↑) and ‘pull-down’(↓) of
the SL follows the XNOR operation of the data stored in the
cell (K1) and the RBL/RBLB charge (A1). With respect to
the example chosen above, one can observe that the first two
rows of the XNOR truth table of Fig. 2(c) are taken care-of.
If the bits corresponding to the activation (A1) was ‘0’ and
‘0’, i.e., RBLB is at 0V while RBL is at VDD, then the charge
flows out of SL through M1’ in left cell, while it flows into
SL through M1 in right cell. This represents the bottom two
rows of the XNOR truth table. Thus, we perform a bitwise
XNOR operation between vectors A1 and K1, represented by
the charge stored on the line SL.
c) Popcount: Since the SL is shared by all the cells along
the row, these ‘pull-ups’ and ‘pull-downs’ are cumulative. As
can be seen from Fig. 2(c), an SL ‘pull-up’ corresponds to a
‘1’ in the output XNORed vector, while an SL ‘pull-down’
corresponds to a ‘0’ in the output XNORed vector. In order to
evaluate the popcount of the output vector, we need to count
the number of 1’s. More 1’s in the output vector implies more
‘pull-ups’ on SL, which in turn implies a higher voltage on
Fig. 4. The plot shows the final SL voltage with and without the Dual RWL
approach. A larger sense margin is obtained with our Dual RWL approach,
thus relaxing the constraints on the low-overhead ADC. Note that with Dual
RWL technique we restrict the distinct voltage levels on SL to 32 at a time,
instead of 64. However, the voltage swing on SL remains the same, thereby
increasing the sense margin between the states.
SL. Thus, the final SL voltage represents the popcount of the
output vector: (A1 XNOR K1). We boost the RWL voltage
such that the SL swing is from 0V to VDD. To sense this
analog voltage we use a charge-sharing based sequentially
integrating ADC, adopted from [25]. Note however, that this is
an approximate, low-precision ADC. Thus, in order to achieve
a fairly accurate estimate of the popcount of the entire row,
we use two techniques described in the next sub-section.
B. Dual Read-Wordline based Dual-stage ADC
In order to evaluate the popcount of the entire memory row
at once, we should be able to distinguish N- number of distinct
states in the analog SL voltage, where N is the number of
columns in a memory array (we choose N=64, for a reasonably
sized array). In the output XNORed vector, there can zero 1’s,
one 1’s, two 1’s, ... , up to N 1’s. Correspondingly, there are N-
different voltage levels on the SL, which need to be sensed by
the ADCs. However, due to area and power constraints within
the memory array, it is infeasible to use high-precision ADCs,
such as the area-expensive SAR or power-hungry Flash ADCs.
We adopt a simple charge-sharing based serially integrating
type ADC for our purposes. However, instead of having to
sense N- distinct analog levels, the ADC only needs to sense
N/8 levels. This is enabled by using a Dual RWL memory
structure, along with a dual-stage ADC.
The Dual RWL technique is shown in Fig. 3(a). Note that
we use two sets of read word-lines (RWL1a, RWL1b) for every
memory row. First half of the cells along the row are connected
to RWL1a, while the rest are connected to RWL1b. The step
2 of the binary convolution (XNOR on SL) described above
is split in two parts. First, only RWL1a is enabled. Thus, only
N/2 cells are enabled to share charge with SL, either pulling-
up or pulling-down the SL voltage. The rest half of the cells
are cut off from the SLs, and cannot participate in the charge
sharing. Once the SL voltage has been sensed by the ADC,
Fig. 5. The figure shows the timing diagrams for the dual-stage ADC scheme.
The figure plots the SL voltage for various popcount cases. In the first-stage,
the sub-class SC1-4 is determined using multiple references (0.25V, 0.5V
and 0.75V). In the second-stage, charge is pumped-in/out of SL successively,
depending on the SC. The number of cycles it takes for SL to reach VREF
are counted. VREF for SC1-4 is 0.25V,0.5V, 0.5V and 0.75V, respectively.
RWL1a is disabled, and RWL1b is enabled. Now, the other
half of the cells share charge to generate a voltage on SL.
Note that this does not change the swing on the SL, since the
SL voltage depends on the capacitive ratio CRBL/CSL. Thus,
the N/2 voltage levels are equally separated out from 0V to
VDD. This can be confirmed from Fig. 4, which shows the SL
voltages for N=64, with and without Dual RWL technique, as
a function of the popcount. Since the separation between the
states has increased, it becomes easier to sense the levels with
a low-overhead ADC.
The ADC used is shown schematically in Fig. 3(b). It
consists of two dummy bitcells per row (only 1 shown in
figure), two SAs, a counter and an ADC logic block. We
employ a dual-stage ADC to sense the analog voltage on SL.
In the first-stage for ADC sensing, we use multiple voltage
references (VDD/4, VDD/2 and 3VDD/4), to classify the analog
voltage levels into four sub-classes SC1, SC2, SC3, SC4 −
[0-VDD/4], [VDD/4-VDD/2], [VDD/2-3VDD/4] and [3VDD/4-
VDD], respectively. This is done using two voltage SAs, since
the voltage swing on SL spans 0V to VDD. On SA N, a
VREF of 3VDD/4 is applied, while for SA P, a VREF of
VDD/4 is applied. If both SA outputs are LOW, the SL voltage
is classified in SC1. Similarly, when both SA outputs are
HIGH, the SL voltage is classified in SC4. Otherwise, VREF
is changed to VDD/2, and the SA outputs are observed again.
If both outputs are HIGH, the SL voltage is classified in SC3,
otherwise SC2. Thus, the first-stage of the ADC generates the
MSB 2bits of the ADC output.
Once the sub-classes of the analog voltage have been
defined, the second-stage of the ADC is initiated. The ADC
logic block generates a bunch of control signals − PCH0,
PCH1 and WLADC , which operate on the dummy bitcells.
For SC1 and SC2, SA P is enabled with a VREF of VDD/4
and VDD/2, respectively. PCH1 is pulsed alternately with
Fig. 6. a) Typical SRAM memory array with row and column peripherals storing the activations A1-Am, and kernels K1-Kn. b) Proposed sectioned-SRAM
array. By introducing switches along the RBLs, the array is divided into sections. The kernels are mapped into the sectioned-SRAM with each section storing
different kernel. Once the activations are read onto the RBLs, the switches are opened, and the memory array is divided into sections. c) Since the RBLs
for each section have been decoupled, one RWL in each section can be simultaneously enabled such that each section performs the binary convolution
concurrently. For example, if A1 was read onto the RBLs before sectioning, enabling the rows K1 and K2 in Section 1 and 2 respectively, we obtain A1*K1
and A1*K2 in parallel.
WLADC , to pump-in a small amount of charge into SL every
cycle through the dummy cells. In each cycle, when WLADC
is LOW and PCH1 is HIGH, the RBL of the dummy cell
is precharged to VDD. When WLADC is HIGH and PCH1
is LOW, the precharged RBL pumps-in charge to the SL. In
successive cycles, the voltage on SL increases. As soon as the
SL voltage exceeds VREF , SA P output flips from LOW to
HIGH. The number of cycles in the process are counted using
a digital counter. On the other hand, for sub-classes SC3 and
SC4, SA N is enabled with a VREF of VDD/2 and 3VDD/4,
respectively. PCH0 is enabled instead of PCH1, thereby
pumping-out charge from SL every cycle. Again, the number
of cycles are counted when SA N flips from HIGH to LOW.
This is illustrated in Fig. 5, which shows the operation of
ADC taking an example of popcount cases 0, 8, 24 and 32, for
N=64. For the popcount case 32 and 24, the sub-classes SC4
and SC3 are determined respectively, thus, charge is pumped
out of SL every cycle. Similarly for the popcount cases 8
and 0, SC2 and SC1 are determined respectively, and charge
is pumped into SL every cycle. The two dummy bitcells are
used to mimic the capacitances of the RBLs/RBLBs, such that
the charge being pumped in/out from SL every cycle by the
dummy bitcells mimics the charge sharing of RBLs/RBLBs
and SL in Step 2 (XNOR on SL) operation. Note that the
amount of charge being pumped-in/pumped-out exponentially
decreases with time. This is a fundamental limit to charge-
sharing type ADCs and thus, they work only if the number
of counts are small. In our case, for N=64, we count only
N/8 = 8 states using this ADC, which gives us fairly accurate
results, as shown later. Thus, the output from the first-stage
(sub-class SC1-4) along with the output from the second-stage
(ADC counts) estimates the number of 1’s (popcount) for the
XNORed output vector. Note that two sets of popcounts, one
from RWL1a and other from RWL1b, are sequentially read,
and then added together to get the final popcount of the vector.
C. Sectioned Memory Array for Parallel Computing
We have seen that XNOR and popcount operations can
be computed within the SRAM array. The manner in which
these computations are done, opens possibilities for improving
the throughput and energy-efficiency in performing binary
convolutions. A typical operation in a CNN layer involves
convolution of input activations with multiple kernels. This
gives us an opportunity for data re-use, since the same set
of activations need to be convolved with different kernels.
Our proposed scheme described above is well suited to exploit
this property of CNNs. Given a set of activations A1, A2,...,
Am and kernels K1, K2,..., Kn, stored within the memory
array (see Fig. 6(a)), we need to compute A1*K1, A1*K2,...,
A1*Kn, A2*K1, A2*K2,..., A2*Kn and so on. In our com-
putations described above, specifically in the psedo-read step,
the data corresponding to A1 is read onto the RBL/RBLB
voltages. We propose sectioning the memory array into sub-
sections by introducing switches along the RBLs, as shown
in Fig. 6(b), such that kernels are grouped into different
sections. Each section consists of a separate ADC control
block, as shown in Fig. 6(c). After A1 has been read onto the
RBLs/RBLBs, the switches are opened. The RBLs/RBLBs in
individual sections store the information of data A1, but have
been decoupled. This allows us to enable one memory row in
all the sections corresponding to kernel K1 in section 1, K2 in
section 2, and so on, thereby evaluating the XNOR-popcount
operations concurrently, in all n sections. We thus obtain the
output A1*K1, A1*K2,...., A1*Kn in a single cycle. This
step can be repeated for all activations A1, A2,..., Am. Thus,
sectioning the memory array improves the throughput of our
computations n-fold. Moreover, with a single pseudo-read step,
we are able to perform n convolutions, thereby saving multiple
pseudo-read cycles which consume bitline precharge energy.
Specifically, without sectioning, one RBL and RBLB pre-
charge is required for every convolution operation in addition
to ADC energy consumption. With n-sections per sub-bank
we obtain n-convolutions per pre-charging of the RBL and
RBLB thereby not only increasing parallelism but also energy-
efficiency.
Let us now discuss how binary convolutions can be obtained
for large kernels using the distributive property of popcount. If
the kernel size is larger than the memory word length, which is
often the case in deeper state-of-the art CNN layers, a single
kernel occupies multiple rows in the same or different sub-
banks. In-memory binary convolution is performed for each
of these kernel rows separately, and the partial popcounts
obtained from each operation are added to generate the final
popcount.
popcount(N +N + ...) = popcount(N) + popcount(N) + ....
Once the final popcount is obtained, the output of the binary
convolution operation is ‘1’ if the final popcount (number of
1’s) is greater than half the kernel size, and ‘0’ otherwise.
D. Results
The sectioned-SRAM array assuming a section size of 32
rows and 64 columns was simulated in HSPICE using the
45-nm predictive transistor models (PTM) [27]. As described
in the previous section, the final voltage at SL denotes the
popcount output of the binary convolution. The SL voltage
is sensed using the ADC described in the previous section.
Again, the 45-nm PTM models were used simulate the SA
and the ADC logic block. Using the Dual RWL along with a
dual-stage ADC, the ADC output is relaxed to only 5bits. The
most-significant bits (2bits) are generated in the first-stage of
the ADC (sub-classes SC1-4) using multiple references, while
the lower bits (3bits) are generated in the second-stage by the
integrating ADC. We observe the effects of CMOS process
variation on the ADC output using Monte Carlo simulations,
in presence of 30mV sigma threshold voltage variation. Fig.
7 plots the distribution of the second-stage ADC output for
various popcount cases. Note that a similar trend repeats for
higher popcount cases with modulo-8, since only the lower
3bits of the output are generated in the second-stage. The
ADC output is fairly accurate with a small overlap with the
neighboring counts. The small inaccuracy is attributed to the
transistor threshold voltage variations in the memory array
and in the SAs used in the ADC. Moreover, the charge being
pumped in/out of SL decreases with each cycle, due to charge-
sharing, thereby inducing errors for higher counts. The inset
shows a best-fitting normal distribution for the variations in
the ADC output. The average standard deviation of the counts
was found to be ∼0.4359 counts. Total energy consumed
per operation was estimated to be ∼0.767pJ and ∼1.914pJ,
with and without sectioned-SRAM (4 sections per bank),
respectively. The energy was averaged over various popcount
cases. Here, by one operation, we mean XNOR + popcount
of a 64bit input activation and a 64bit kernel, both of which
are stored in the SRAM. The energy consumption includes
the pre-charge energy in the pseudo-read step and the ADC
energy. The latency of one operation was ∼45ns. This is due to
the low-overhead integrating ADC used, which serially counts
to estimate the popcount.
Fig. 7. Monte-Carlo simulations. The figure plots the histogram of the second-
stage output of the ADC, for various popcount cases, in presence of process
variations. Inset: Each histogram is fitted with a Gaussian distribution. The
average standard deviation of the counts is ∼0.4359. The trend repeats for
higher popcount cases with modulo-8, since only the lower 3bits of the output
are generated in the second-stage.
III. IN-MEMORY BINARY CONVOLUTION − PROPOSAL-B
In the previous section, we described an energy-efficient
implementation of performing binary convolutions within the
SRAM array. However, the low-overhead ADC used to deter-
mine the popcount induces errors in the convolution output,
which may impact the system accuracy, as we will show later.
The primary cause of the inaccuracy is the generation and
detection of an analog voltage, which is susceptible to noise,
offset etc. Thus, in this section, we propose yet another imple-
mentation of enabling binary convolutions in standard SRAM
arrays, by modifying the peripheral circuitry. This approach is
robust since the popcount is computed using digital logic gates
(full-adders), unlike Proposal-A which uses analog voltages.
Although this robustness comes at a cost of energy-efficiency
and throughput as compared to the previous proposal based on
charge-sharing, our simulations show that this implementation
is still better than the typical von-Neumann based approach as
it leverages in-memory computing for XNORs and pop-count
operations.
A. Bitwise XNORs
Bitwise Boolean operations within SRAM arrays have re-
cently been demonstrated in [15], [16], [28]. The idea is to
enable two RWLs together during a read operation. Let us
consider words A and B stored in two rows of the mem-
ory array. Note that we can simultaneously enable the two
corresponding RWLs without worrying about read disturbs,
since the bit-cell has decoupled read-write paths (shown in Fig.
8(a)). The RBL/RBLB are pre-charged to VDD. For the case
AB = ‘00’ (‘11’), RBL (RBLB) discharges to 0V, but RBLB
(RBL) remains in the precharged state. However, for cases
‘10’ and ‘01’, both RBL and RBLB discharge simultaneously.
The four cases are summarized in Fig. 8(b). Now, in order
to sense bit-wise XNOR from the RBL/RBLB voltages, we
use two asymmetric SAs (see Fig. 8(c) [15]) which compute
the bitwise NAND/NOR in parallel. Asymmetric SAs work
by sizing either one of the transistors MBL/MBLB bigger
than the other. In Fig. 8(c), if the transistor MBL is sized
Fig. 8. (a) A 10T-SRAM bitcell schematic is repeated here for convenience. (b) Timing diagram used for in-memory computing with 10T-SRAM bitcells.
(c) Circuit schematic of the asymmetric differential sense amplifier. [15]
bigger compared to MBLB , its current carrying capability
increases. Thus, for cases ‘01’ and ‘10’ where both RBL
and RBLB discharge simultaneously, SAout node discharges
faster, and the cross-coupled inverter pair of the SA stabilizes
with SAout=‘0’. While for the case ‘11’(‘00’), RBL(RBLB)
starts to discharge, and RBLB(RBL) is at VDD, making
SAout=‘1’(‘0’). Thus it can be observed that SAout generates
an AND gate (thus, SAoutb outputs NAND gate). Thus, we
call this sense-amp SANAND. Similarly, by sizing the MBLB
bigger than MBL, OR/NOR gates can be obtained and we
call it SANOR. Next, by ORing the NOR and AND outputs
obtained from SANOR and SANAND respectively, bitwise
XNOR operation is realized. A detailed description of the bit-
wise Boolean XNOR used in this work can be found in [15].
B. Popcount
In order to utilize the above mentioned approach for en-
abling binary convolutions, we propose to add a bit-tree adder
after the asymmetric-SA stage to generate the popcount, as
shown in Fig. 9. By enabling RWLs corresponding to rows
storing activation (A1) and kernel (K1), the asymmetric-SAs
generate the XNORed vector. The output XNORed vector is
passed to the bit-tree adder. It consists of multiple full-adder
(FA) blocks connected in a tree manner. The bit-tree adder
sums up all the bits of the output XNORed vector to generate
the popcount. The first layer of the bit-tree adder consists
of single FA blocks, each of which is capable of adding
three consecutive bits to generate a 2-bit output. In the next
layer, 2-bit adders are used, which are constructed using two
stacked FA blocks. The second layer generates 3-bit output. In
subsequent layers, multiple FA blocks are stacked to construct
multi-bit adders. Finally in the log(N) layer, where N is the
number of columns in the sub-array, the popcount output is
generated, and is read out from the memory.
To incorporate convolutions with large kernel sizes, the
partial popcount generated from the bit-tree adders can be
summed up over multiple cycles, to generate the final pop-
count. Note that the generated popcount is exact, as it is
computed using conventional digital logic gates. Also note
Fig. 9. Modified peripheral circuitry of the SRAM array to enable binary
convolution operations. It consists of two asymmetric SAs - SANOR and
SANAND which pass the XNORed data vector to a bit-tree adder. The adder
has log(N) layers, where N is the number of inputs to the adder. It sums the
input bits to generate the popcount.
that the sectioned-SRAM concept described in the previous
section is not applicable for this proposal.
Fig. 10. (a) Modified von-Neumann architecture based on Xcel-RAM memory banks and enhanced instruction set architecture (ISA) of the processor. (b)
Snippet of assembly code for performing a binary convolution operation using conventional instructions and custom instructions.
C. Results
A 128 × 64-bit SRAM array along with the asymmetric
SAs − SANOR and SANAND were simulated in HSPICE
using the 45-nm predictive transistor models (PTM) [27]. As
described above, two RWLs are enabled simultaneously, and
depending on the data stored in each of the bits, SANOR and
SANAND generate bitwise NOR/OR and NAND/AND, re-
spectively. Readers are referred to [15] for more circuit details
and simulations. The energy consumption and latency of the
bitwise XNOR operation was estimated to be 29.67fJ/bit and
1ns, respectively. The energy consumption includes the pre-
charge energy and the energy consumed in asymmetric-SAs.
The bit-tree adder was modeled in Verilog, and synthesized
using Synopsys Design Compiler to 45-nm tech node. The
inputs to the bit-tree adder block are 64 wires, which represent
the bitwise XNORed data generated from the SA stage. The
output is a 6-bit popcount. The total power and the critical-path
delay of the bit-tree adder in performing a 64-bit popcount,
was estimated to be 0.26mW and 0.3ns, respectively.
IV. SYSTEM-LEVEL EVALUATION FRAMEWORK FOR BNN
In this section, we describe the framework developed to
evaluate the benefits of our proposals at a system-level, taking
an example of a deep binary neural network. We use a
modified von-Neumann based system architecture, where the
SRAM banks are replaced with our proposed Xcel-RAM
banks (Proposal-A/Proposal-B) with embedded convolution
compute capabilities. By utilizing these in-memory convolu-
tions, we demonstrate the benefits in the overall system energy
consumption and latency per inference.
A. Simulation Methodology
The modified von-Neumann processing architecture is
shown in Fig. 10(a). It consists of a processor, an Xcel-RAM
memory-block and an instruction-memory, connected by a sys-
tem bus. The Xcel-RAM block consists of multiple subarrays
that are arranged in a typical banked structure. We use the
CACTI tool [29] to model a 64KB Xcel-RAM bank. The
circuit numbers for a subarray obtained from HSPICE with the
45nm PTM models [27] were put in CACTI to obtain the per-
access energy and latency of memory read/write operations as
well as binary convolution operation. These include the energy
TABLE I
BENCHMARK BINARY NEURAL NETWORK [11] USED FOR CLASSIFYING
CIFAR10 DATASET.
consumed in H-trees, WL decoders, BL drivers, SAs, muxes
etc. Next, a cycle-accurate RTL model was developed for Xcel-
RAM banks, which was integrated with Intel’s programmable
Nios-II processor [30], with instruction set (ISA) extensions to
leverage the Xcel-RAM compute capabilities (see Fig. 10(b)).
The system bus follows the Avalon memory-mapped protocol,
with enhanced bus architecture to support passing multiple
addresses at a time. Note that this is not a large overhead since
in-memory instructions do not pass the data operands, and
thus the data-channel is used to pass extra memory addresses
over the bus [31]. Note that although we show a typical von-
Neumann based system, Xcel-RAM banks can be interfaced
with general purpose graphics processing units (GP-GPUs)
based systems as well, to leverage data parallelism along with
in-memory computing. Our aim here was to show the benefits
of replacing conventional SRAM banks with compute capable
Xcel-RAM banks.
The binary neural network (BNN) proposed in [11] uses
binary bipolar activations (±1) for both weights and activa-
tions. Note that in our memory, +1 is stored as logic HIGH
bit, while −1 is stored as logic LOW bit. We trained a BNN
using the algorithm proposed in [11] on Pytorch Platform
[32] using the github repository [33] of the same work. The
Fig. 11. (a) Layer-wise (a) energy consumption and (b) latency, for running the CIFAR-10 image classification benchmark on the proposed designs, and the
baseline.
neural network architecture is given in Table I. The network
was evaluated on CIFAR-10 [34]. All layers were binarized,
except Conv1 and FC3 layers. It was observed that ∼ 99.4%
of total computations occur in the binarized layers - Conv2-6
and FC1-2, all of which can utilize the Xcel-RAM convolution
capabilities (see Table I). Or in other words, ∼ 99.4% of total
computations per-inference can be mapped using custom Xcel-
RAM instructions, thereby giving us significant improvements
in energy and throughput. Each of these layers were run on
the modified von-Neumann architecture described above. We
assume that the binarized kernels are stored in an off-chip
memory, and the kernels for a particular layer are loaded
into the SRAM before processing that layer. Typical values
of DRAM access energy and latency were taken from litera-
ture [35]. The software was modified by replacing repetitive
convolution operations with our custom instruction macros.
In every layer, the convolutions are split into multiple 64-
bit XNOR+popcount operations, which are then accumulated
to compute the final output. The final output is stored back
into the SRAM, which would be the input activations for the
succeeding layer.
As a baseline, we use a similar system architecture, but with
standard SRAM banks with only read/write capability, instead
of Xcel-RAM banks. The convolution operation is performed
in software through conventional instructions. A snippet of the
assembly code for convolution in the baseline and Xcel-RAM
based designs is shown in Fig. 10(b).
B. Results and Discussion
The full precision accuracy of the network was 91.703%.
The accuracy of the binary neural network was observed to
be 89.294%, an expected drop due to binarization. We then
evaluate the impact of inaccuracies in the ADC for Proposal-
A (due to process variations) on the classification accuracy
using our simulation framework. At every binarized layer, each
element of the output map is a sum of N binary XNORs,
where N = k2 × I , k is the filter height, and I is the
number of input channels. Our proposed methodology can
perform 64 binary operations at once, in two steps of 32 bits
each. Hence, the number of popcounts done per element of
an output map is M = ceil(N/32). We add the popcount
error to the output during inference, obtained from circuit
simulations, and obtained an accuracy of 88.710%, a decrease
by 0.584% from the ideal BNN accuracy of 89.294%. On the
other hand, Proposal-B obtains ideal BNN accuracy because
the computations are done using a digital adder-tree.
Fig. 11 shows the layer-wise energy consumption and
latency for Proposal-A, Proposal-B, and the baseline. Note
that we focus only on layers Conv2-6 and FC1-2, as they
constitute majority of the total computations. It can be ob-
served that layers Conv2,4,6 are the most compute intensive
layers, due to larger kernels. Overall, per-inference, 6.1× and
2.3× improvements were obtained in energy consumption,
for Proposal-A and Proposal-B, respectively, compared to the
baseline. In terms of latency, 15.8× and 8.1× improvements
were obtained per-inference, for Proposal-A and Proposal-
B, respectively. These improvements can be attributed to the
fact that the most compute intensive operations involved in
the BNN inference − bitwise-XNOR followed by popcount,
are performed efficiently within the memory, thereby saving
majority of unnecessary memory accesses and computations.
Moreover, the energy and latency benefits of Proposal-A arise
from the low-overhead ADC and the sectioned SRAM arrays,
which enable multiple operations in a single memory access.
In Proposal-B, although the sectioning is not applicable, the
energy and latency benefits arise from the bit-wise XNOR
computations on the bitline using asymmetric SAs and the
digital bit-tree adder to generate the result in the memory array
itself.
V. CONCLUSION
Enhanced memory blocks having built-in compute function-
ality can operate as on-demand accelerators for machine learn-
ing computations, while simultaneously operating as usual
memory read-write units for general-purpose workloads. In
this work, we demonstrated two novel techniques to enable
binary convolutions within a standard SRAM memory arrays.
In the first proposal, we use charge-sharing on the inherent
parasitic capacitances present in the 10T-SRAM structure to
embed vector XNOR operations. Further, we use a dual-
read wordline along with a dual-stage ADC, to handle the
inaccuracies in the low precision, low-overhead ADC. A
key highlight of this proposal is the sectioned-SRAM, which
enables multi-row convolutions in parallel, thereby improving
the overall system performance and energy-efficiency. The
second proposal uses asymmetric SAs and a bit-tree adder
in the memory peripherals to perform bit-wise XNOR com-
putations and popcount in-memory. A complete framework
was developed to evaluate a benchmark application (CIFAR-
10) using our proposed memory arrays. For a system with
our proposed Xcel-RAM banks, 6.1× and 2.3× improvements
were obtained in energy consumption, and 15.8× and 8.1×
improvements were obtained in the latency for the respective
proposals, compared to conventional SRAM based system.
ACKNOWLEDGEMENTS
The research was funded in part by C-BRIC, one of
six centers in JUMP, a Semiconductor Research Corporation
(SRC) program sponsored by DARPA, the National Science
Foundation, Intel Corporation and Vannevar Bush Faculty
Fellowship.
REFERENCES
[1] Y. Bengio et al., “Learning deep architectures for AI,” Foundations and
trends R© in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
[2] N. Jones, “The learning machines,” Nature, vol. 505, no. 7482, p. 146,
2014.
[3] D. Silver et al., “Mastering the game of go with deep neural networks
and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). IEEE, jun 2016.
[5] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den
Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanc-
tot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever,
T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis,
“Mastering the game of go with deep neural networks and tree search,”
Nature, vol. 529, no. 7587, pp. 484–489, jan 2016.
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
A Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
[7] A. Krizhevsky et al., “Imagenet classification with deep convolutional
neural networks,” in Advances in neural information processing systems,
2012, pp. 1097–1105.
[8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in 2015 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). IEEE, jun 2015.
[9] M. L. Schneider, C. A. Donnelly, S. E. Russek, B. Baek, M. R. Pufall,
P. F. Hopkins, P. D. Dresselhaus, S. P. Benz, and W. H. Rippard, “Ul-
tralow power artificial synapses using nanotextured magnetic josephson
junctions,” Science advances, vol. 4, no. 1, p. e1701329, 2018.
[10] J. Backus, “Can programming be liberated from the von neumann style?:
A functional style and its algebra of programs,” Commun. ACM, vol. 21,
no. 8, pp. 613–641, Aug. 1978.
[11] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Ben-
gio, “Binarized neural networks: Training deep neural networks with
weights and activations constrained to+ 1 or-1,” arXiv preprint
arXiv:1602.02830, 2016.
[12] G. Srinivasan, A. Sengupta, and K. Roy, “Magnetic tunnel junction en-
abled all-spin stochastic spiking neural network,” in Design, Automation
Test in Europe Conference Exhibition (DATE), 2017, March 2017, pp.
530–535.
[13] ——, “Magnetic tunnel junction based long-term short-term stochastic
synapse for a spiking neural network with on-chip STDP learning,”
Scientific Reports, vol. 6, no. 1, jul 2016.
[14] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
Imagenet classification using binary convolutional neural networks,” in
European Conference on Computer Vision. Springer, 2016, pp. 525–
542.
[15] A. Agrawal, A. Jaiswal, C. Lee, and K. Roy, “X-SRAM: Enabling in-
memory boolean computations in CMOS static random access memo-
ries,” IEEE Transactions on Circuits and Systems I: Regular Papers, pp.
1–14, 2018.
[16] Q. Dong, S. Jeloka, M. Saligane, Y. Kim, M. Kawaminami, A. Harada,
S. Miyoshi, D. Blaauw, and D. Sylvester, “A 0.3v VDDmin 4+2t SRAM
for searching and in-memory computing using 55nm DDC technology,”
in 2017 Symposium on VLSI Circuits. IEEE, jun 2017.
[17] A. Jaiswal, I. Chakraborty, A. Agrawal, and K. Roy, “8t SRAM cell
as a multi-bit dot product engine for beyond von-neumann computing,”
arXiv preprint arXiv:1802.08601, 2018.
[18] Y. Chen et al., “Dadiannao: A machine-learning supercomputer,” in
Proceedings of the 47th Annual IEEE/ACM International Symposium
on Microarchitecture. IEEE Computer Society, 2014, pp. 609–622.
[19] Y.-H. Chen et al., “Eyeriss: An energy-efficient reconfigurable accelera-
tor for deep convolutional neural networks,” IEEE Journal of Solid-State
Circuits, vol. 52, no. 1, pp. 127–138, 2017.
[20] A. Agrawal, A. Ankit, and K. Roy, “SPARE: Spiking neural network
acceleration using rom-embedded rams as in-memory-computation prim-
itives,” IEEE Transactions on Computers, pp. 1–1, 2018.
[21] A. Ankit et al., “RESPARC: A reconfigurable and energy-efficient ar-
chitecture with memristive crossbars for deep spiking neural networks,”
in Proceedings of the 54th Annual Design Automation Conference 2017
on - DAC 17. ACM Press, 2017.
[22] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-
chan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: A convolutional
neural network accelerator with in-situ analog arithmetic in crossbars,”
in 2016 ACM/IEEE 43rd Annual International Symposium on Computer
Architecture (ISCA). IEEE, jun 2016.
[23] F. Alibart, L. Gao, B. D. Hoskins, and D. B. Strukov, “High precision
tuning of state for memristive devices by adaptable variation-tolerant
algorithm,” Nanotechnology, vol. 23, no. 7, p. 075201, jan 2012.
[24] A. Chen and M.-R. Lin, “Variability of resistive switching memories
and its impact on crossbar array performance,” in Reliability Physics
Symposium (IRPS), 2011 IEEE International. IEEE, 2011, pp. MY–7.
[25] A. Biswas and A. P. Chandrakasan, “Conv-ram: An energy-efficient
SRAM with embedded convolution computation for low-power
cnn-based machine learning applications,” in Solid-State Circuits
Conference-(ISSCC), 2018 IEEE International. IEEE, 2018, pp. 488–
490.
[26] B. Razavi, AnalogtoDigital Converter Architectures. Wiley-IEEE Press,
1995, pp. 272–.
[27] “Predictive technology models,” [Online] http://ptm.asu.edu/, June 2016.
[28] S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw, “A 28 nm con-
figurable memory (tcam/bcam/sram) using push-rule 6t bit cell enabling
logic-in-memory,” IEEE Journal of Solid-State Circuits, vol. 51, no. 4,
pp. 1009–1021, April 2016.
[29] “Cacti 6.0: A tool to understand large caches,” [Online]
http://www.hpl.hp.com/research/cacti/.
[30] “Nios II processor overview,” in Embedded SoPC Design with Nios II
Processor and VHDL Examples. John Wiley & Sons, Inc., sep 2011,
pp. 179–188.
[31] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in memory
with spin-transfer torque magnetic ram,” IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 26, no. 3, pp. 470–483,
March 2018.
[32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in
pytorch,” in NIPS-W, 2017.
[33] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,
“Binarynet.pytorch,” https://github.com/itayhubara/BinaryNet.pytorch,
2017.
[34] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from
tiny images,” Citeseer, Tech. Rep., 2009.
[35] N. Chatterjee, M. OConnor, D. Lee, D. R. Johnson, S. W. Keckler,
M. Rhu, and W. J. Dally, “Architecting an energy-efficient DRAM
system for GPUs,” in 2017 IEEE International Symposium on High
Performance Computer Architecture (HPCA). IEEE, feb 2017.
