Deep in-memory computing by Kang, Mingu
© 2017 Mingu Kang
DEEP IN-MEMORY COMPUTING
BY
MINGU KANG
DISSERTATION
Submitted in partial fulﬁllment of the requirements
for the degree of Doctor of Philosophy in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2017
Urbana, Illinois
Doctoral Committee:
Professor Naresh R. Shanbhag, Chair
Professor Rob A. Rutenbar
Associate Professor Pavan Kumar Hanumolu
Associate Professor Naveen Verma, Princeton University
ABSTRACT
There is much interest in embedding data analytics into sensor-rich platforms such as wear-
ables, biomedical devices, autonomous vehicles, robots, and Internet-of-Things to provide
these with decision-making capabilities. Such platforms often need to implement machine
learning (ML) algorithms under stringent energy constraints with battery-powered electron-
ics. Especially, energy consumption in memory subsystems dominates such a system's energy
eﬃciency. In addition, the memory access latency is a major bottleneck for overall system
throughput. To address these issues in memory-intensive inference applications, this disser-
tation proposes deep in-memory accelerator (DIMA), which deeply embeds computation into
the memory array, employing two key principles: (1) accessing and processing multiple rows
of memory array at a time, and (2) embedding pitch-matched low-swing analog processing
at the periphery of bitcell array. The signal-to-noise ratio (SNR) is budgeted by employing
low-swing operations in both memory read and processing to exploit the application level's
error immunity for aggressive energy eﬃciency.
This dissertation ﬁrst describes the system rationale underlying the DIMA's processing
stages by identifying the common functional ﬂow across a diverse set of inference algorithms.
Based on the analysis, this dissertation presents a multi-functional DIMA to support four
algorithms: support vector machine (SVM), template matching (TM), k-nearest neighbor
(k-NN), and matched ﬁlter. The circuit and architectural level design techniques and guide-
lines are provided to address the challenges in achieving multi-functionality. A prototype
integrated circuit (IC) of a multi-functional DIMA was fabricated with a 16 KB SRAM array
in a 65 nm CMOS process. Measurement results show up to 5.6× and 5.8× energy and delay
reductions leading to 31× energy delay product (EDP) reduction with negligible (≤1%) ac-
curacy degradation as compared to the conventional 8-b ﬁxed-point digital implementation
ii
optimally designed for each algorithm.
Then, DIMA also has been applied to more complex algorithms: (1) convolutional neu-
ral network (CNN), (2) sparse distributed memory (SDM), and (3) random forest (RF).
System-level simulations of CNN using circuit behavioral models in a 45 nm SOI CMOS
demonstrate that high probability (>0.99) of handwritten digit recognition can be achieved
using the MNIST database, along with a 24.5× reduced EDP, a 5.0× reduced energy, and
a 4.9× higher throughput as compared to the conventional system. The DIMA-based SDM
architecture also achieves up to 25× and 12× delay and energy reductions, respectively, over
conventional SDM with negligible accuracy degradation (within 0.4%) for 16×16 binary-
pixel image classiﬁcation. A DIMA-based RF was realized as a prototype IC with a 16 KB
SRAM array in a 65 nm process. To the best of our knowledge, this is the ﬁrst IC realization
of an RF algorithm. The measurement results show that the prototype achieves a 6.8× lower
EDP compared to a conventional design at the same accuracy (94%) for an eight-class traﬃc
sign recognition problem.
The multi-functional DIMA and extension to other algorithms naturally motivated us to
consider a programmable DIMA instruction set architecture (ISA), namely MATI. This dis-
sertation explores a synergistic combination of the instruction set, architecture and circuit
design to achieve the programmability without losing DIMA's energy and throughput ben-
eﬁts. Employing silicon-validated energy, delay and behavioral models of deep in-memory
components, we demonstrate that MATI is able to realize nine ML benchmarks while in-
curring negligible overhead in energy (< 0.1%), and area (4.5%), and in throughput, over a
ﬁxed four-function DIMA. In this process, MATI is able to simultaneously achieve enhance-
ments in both energy (2.5× to 5.5×) and throughput (1.4× to 3.4×) for an overall EDP
improvement of up to 12.6× over ﬁxed-function digital architectures.
iii
To my wife, children, and my parents for their patience and support.
iv
ACKNOWLEDGMENTS
My deepest thanks go to my wife and children, Steven and Brandon, who have gone through
the most hectic period in our lives together. My parents are always proud of my academic
career, which was great encouragement to me. I am exceptionally grateful to my adviser,
Professor Naresh Shanbhag for his extensive amount of time and eﬀort to train me as an inde-
pendent researcher. He demonstrated doing his best every single moment, uncompromising
high standards, and ceaseless passion for teaching. Especially, I made a mistake in my ﬁrst
prototype IC and wasted signiﬁcant time and research funds. But he encouraged rather than
blamed me and allowed one more opportunity to make up for my mistake. Thanks to his
trust in me, I was able to complete this dissertation and build the mixed-signal IC tape-out
ﬂow for machine learning in our group. I appreciate Professor Pavan Hanumolu, Professor
Naveen Verma, and Professor Rob Rutenbar for their insightful suggestions and for agree-
ing to be on my committee. Their suggestions during my prelim have greatly aﬀected my
research direction and helped to improve my Ph.D. research. I would like to thank Ameya
Patil and Yongjune Kim for theoretical analysis of deep in-memory computation in Chapter
2, and Sujan Gonugondla for providing simulation models for the convolutional neural net-
work in Chapter 4. It was fantastic to have the opportunity to work with Professor Vikram
Adve and his student, Prakalp Srivastava for studying the programmable deep in-memory
architecture in Chapter 5. I would also like to sincerely thank Min-sun Keel and Wooseok
Choi for their generous help and patient guidance for analog and digital IC tape-out ﬂows. I
am also grateful to my research group members, Sungmin Lim, Yingyan Lin, Charbel Sakr,
Sujan Gonugondla, Ameya Patil, and Dr. Yongjune Kim; my research group alumni, Dr.
Sai Zhang and Dr. Eric Kim for their valuable advice. I gratefully acknowledge Systems on
Nanoscale Information fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored
v
by SRC and DARPA. I would also like to acknowledge constructive discussions with Sean
Eilert, Ken Curewitz, Professor Naveen Verma, Professor Boris Murmann, and Professor
Pavan Hanumolu. Finally, this journey would not have been possible without the members
of Korean ECE marathon team "Before sunrise", Minji Kim, Hojeong Yu, Key-whan Chung,
and Sungmin Lim. I was able to maintain my physical and spiritual stamina via training
with them every Saturday.
vi
CONTENTS
Chapter 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Dissertation Contributions and Organization . . . . . . . . . . . . . . . . . . 6
Chapter 2 DEEP IN-MEMORY ARCHITECTURE (DIMA) . . . . . . . . . . . . . 9
2.1 DIMA Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 DIMA Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Functional Read (FR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Bitline Processing (BLP) and Cross BLP (CBLP) . . . . . . . . . . . . . . . 26
2.5 Models for Circuit Non-Ideal Behavior, Energy, and Delay . . . . . . . . . . 32
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Chapter 3 DIMA PROTOTYPE INTEGRATED CIRCUITS . . . . . . . . . . . . . 48
3.1 Multi-Functional DIMA Architecture . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Multi-Functional DIMA Operations . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Measured Results of Multi-Functional DIMA IC . . . . . . . . . . . . . . . 56
3.4 Random Forest (RF) DIMA IC . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Chapter 4 MAPPING INFERENCE ALGORITHMS TO DIMA . . . . . . . . . . . 72
4.1 Convolutional Neural Network (CNN) . . . . . . . . . . . . . . . . . . . . . . 72
4.2 Sparse Distributed Memory (SDM) . . . . . . . . . . . . . . . . . . . . . . . 80
4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Chapter 5 MATI: DIMA INSTRUCTION SET ARCHITECTURE . . . . . . . . . 103
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2 MATI: A Programmable DIMA . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.3 Validation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.4 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Chapter 6 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . 129
6.1 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
vii
Chapter 1
INTRODUCTION
Current and emerging applications increasingly rely on the ability to extract patterns from
large data sets in order to support inference and decision making. These applications rely
heavily on machine learning (ML). Though ML algorithms have begun to exceed human
performance in cognitive and decision-making tasks [1, 2], they tend to be computationally
complex and require processing large data volumes. These tasks have been processed in the
cloud platform due to the heavy processing complexity as shown in Fig. 1.1(a) [3]. However,
this paradigm requires a large volume of data transfer to data centers and also the extracted
information from the cloud back to the electronics causing 9× more energy for the transfer
than the processing itself [3]. Therefore, there is increasing interest in embedding data
analytics into sensor-rich platforms to provide these decision-making capabilities locally as
shown in Fig. 1.1(b). Such platforms often need to implement ML algorithms under severe
resource constraints. Primary metrics for the design of such intelligent systems are: (1)
energy eﬃciency, (2) decision latency and throughput, and (3) decision(-making) accuracy.
Energy eﬃciency is critical for embedded battery-powered and autonomous platforms. As
a result, a number of integrated circuit (IC) implementations of ML kernels and algorithms
have appeared recently [413] to address the problems of designing energy eﬃcient ML
systems in silicon.
ML algorithms are computationally intensive and require processing of large data volumes.
Therefore, the energy consumption of ML hardware comprises the energy costs of memory
accesses and arithmetic operations. Of these, memory accesses tend to dominate as each
access is expensive, e.g., 20100 pJ per access of 16-b word from 32 kB to 1 MB SRAM
versus 1 pJ per multiplication in a 45 nm process [19]. This observation was conﬁrmed with
two simple inference tasks: pattern matching using (1) Manhattan distance and (2) cross
1
Raw data Decision
(a)
Decision
(b)
Figure 1.1: Machine learning (ML) platforms: (a) in the cloud, and (b) in silicon (ﬁgure
courtesy [1418]).
2
(a) (b)
Figure 1.2: Energy breakdown: (a) pattern matching using Manhattan distance, and (b)
cross correlation, where dotted part is from digital computational block (including clock
tree network) and others are from memory. The peripheral circuitry (Peri) includes the
energy from sense ampliﬁer (SA), decoder, 4:1 column mux, and WL driver.
correlation to ﬁnd the closest image from an input query image out of 64 candidate images.
A SRAM with a 512×256 bitcell array and synthesized digital logic with 8-b in/output pre-
cision was employed for post-layout simulations in a 65 nm process. The energy breakdown
(Fig. 1.2) demonstrates that memory energy dominates taking up to 90% of the total sys-
tem energy. In addition, recent implementations of deep neural networks (DNN) [4,20] also
report that memory accesses account for the largest portion (between 35% to 45%) of the
total energy cost.
The data access costs are reported in the context of von Neumann architecture, which
separates memory from processor (Fig. 1.3(a)). In memory-intensive applications, this sepa-
ration severely increases memory access energy and limits the throughput, referred to as the
von Neumann's bottleneck [21]. Hence, there is an imperative need to re-think the processor
and memory designs for the memory-intensive inference applications.
3
PMemory array
(D)
Processor
Sense amplifiers
decision
(a)
Memory array
(D)
decision
Analog processorP
(b)
Figure 1.3: Architectures for inference: (a) conventional, and (b) deep in-memory
architecture (DIMA), where P is an input pattern and D is stored database.
1.1 Related Work
Several research eﬀorts have tried to minimize the data access cost through architectural
optimization. Processor-in-memory (PIM) architectures such as Smart Memory [2224] and
Intelligent RAM [25] locate frequently used logic (e.g., pointer logic [24] or MAC [25]) close
to memory using a wide crossbar. However, physical proximity does not reduce memory
read and processing costs themselves.
An eﬀective approach to reduce memory access energy and enhance throughput for ML
algorithms is to reuse the data once it is read from memory. DianNao [20] ﬁrst identiﬁed
and exploited an opportunity of massive data reuse across the fetched tile of input and
output feature maps in a convolutional neural network (CNN) achieving 21× energy savings.
Eyeriss [4, 26] extended the data reuse opportunities at multiple levels (convolutional, ﬁlter
and input feature map reuse) to achieve up to 2.5× energy savings for AlexNet.
Low-power circuit techniques have also been explored to achieve energy eﬃcient processing
for ML algorithms. ENVISION [12] implemented the CNN with a dynamic voltage-accuracy-
frequency scaling technique given a bit-precision requirement. A speech recognizer with a
deep neural network (DNN) [10] is also introduced with a voice-activated power gating
4
technique for energy eﬃciency during stand-by mode. A sparse DNN engine [11] applied the
RAZOR technique [27] to allow minimum supply voltage by tolerating timing errors. These
approaches achieve signiﬁcant energy eﬃciency, but without exploiting the opportunities
aﬀorded by analog processing.
Low-voltage SRAM techniques have been proposed [28,29] to reduce the energy of memory
read accesses. These techniques involve operating the bitcell array (BCA) at voltages in
the range of a few hundred mVs, which reduces the throughput signiﬁcantly into the kHz
regime. The low-voltage operations degrade SRAM's read and write static margins causing
catastrophic failure of inference applications when MSB errors occur. Therefore, SRAM
was tailored for inference algorithms to address this issue [30, 31]. Here, selective bitline
(BL) negative boosting is employed to improve write-ability and protect MSBs during the
read operation by selective error correcting code (ECC). However, these techniques suﬀer
from large BL toggling energy and dropping out of LSBs to accommodate ECC check bits,
respectively. In [13], a ﬁlter approximation technique was employed and accelerated by
7T SRAM to fetch convolution ﬁlter coeﬃcients eﬃciently by enabling two read modes:
row-access and column-access, but at the cost of degraded storage density by employing an
additional transistor in the bitcell.
To sum up, previous approaches have addressed the energy cost by co-locating the pro-
cessor and memory, minimizing data accesses (via data reuse) or employing low-power dig-
ital techniques for processing and low-voltage memories. In contrast, associative memo-
ries [32,33] embed simple logic operations into the BCA to determine a data vector with the
minimum Hamming or Euclidean distances from a reference data vector. This is done at the
expense of storage density due to the logic circuits added to a bitcell. Kerneltron [34] also
embeds computation (bit-wise multiplication) into the BCA to process and read simultane-
ously in the charge domain. However, this requires the use of charge injection devices in the
BCA and a massive array of ADCs to interface analog and digital processing. Moreover, the
need for special devices makes it incompatible with mainstream memory topologies such as
SRAM or DRAM.
5
1.2 Dissertation Contributions and Organization
This dissertation proposes deep in-memory architecture (DIMA) [35], which eliminates the
separation between memory and processor to minimize the cost of not only memory read but
also processing as shown in Fig. 1.3(b). This is achieved by unconventional ways of accessing
data from memory and deeply embedding mixed-signal circuitry into the memory. DIMA is
characterized by the following:
 multi-row low-swing (e.g., < 30 mV/LSB) memory read: multiple rows of BCA are ac-
cessed simultaneously via pulse width or amplitude modulated (PWAM) WL, referred
to as multi-row functional read (FR)
 low-swing data processing: pitch-matched mixed-signal circuitry is embedded at the
periphery of the BCA for further processing of BL information
 preservation of the standard BCA structure: thereby storage density and conventional
read/write functionality are maintained without incurring delay and energy penalty
This dissertation proves the versatility of DIMA by achieving up to 24.5× reduction of
energy-delay product (EDP) in various algorithms such as a template matching (TM) [35],
CNN [36], and sparse distributed memory (SDM) [37, 38] with simulations. The DIMA's
concept has also been veriﬁed with two prototype ICs demonstrating up to 31× reduction
of EDP. This dissertation also introduces energy, delay, and mixed-signal circuit behavioral
models, which are employed to predict the system level's performance as well as energy and
delay trends as a function of major design parameters. Finally, the DIMA platform is ex-
tended to programmable instruction set architecture (ISA) to support various ML algorithms
and user-friendly programming interface.
While DIMA has strong potential for energy and throughput beneﬁts, the following design
challenges need to be addressed without compromising the accuracy:
 Algorithm: the common functional ﬂow across a diverse set of ML algorithms needs
to be identiﬁed and then mapped on DIMA's sequential processing stages to cover a
6
wide variety of ML algorithms.
 Architecture: the BCA has to be kept intact, thereby preserving the storage density
of standard SRAM. The read/write functionality also needs to be preserved without
incurring delay and energy penalty.
 Circuit: multiple functions need to be enabled with re-conﬁgurable mixed-signal cir-
cuitry complying with the stringent row and column pitch-matching requirements im-
posed by the BCA.
 Modeling: accurate statistical modeling of the non-ideal analog circuit behaviors is
required to study the impact of non-idealities to the application level's accuracy. The
accuracy needs to be maintained even with diverse noise sources from low-SNR pro-
cessing.
 Programmability: the instruction set needs to be designed considering analog driven
DIMA operations. In addition, the throughput and accuracy losses from introducing
the programmability should be minimized.
These challenges are addressed in the following chapters:
Chapter 2 introduces DIMA, its unique features and processing stages. The system-
level rationale is also provided to demonstrate DIMA's robustness in the presence of noise.
Furthermore, design techniques and guidelines are provided to address the circuit and archi-
tectural levels' implementation challenges. In addition, this chapter provides energy, delay
and behavioral models with the key design parameters. Then, the models are employed to
predict the application level's accuracy.
Chapter 3 presents two DIMA prototype ICs in a 65 nm CMOS process: (1) multi-
functional DIMA that supports four algorithms: support vector machine (SVM), template
matching (TM), k-nearest neighbor (k-NN), and matched ﬁlter (MF), and (2) random forest
(RF). This chapter also describes design techniques to achieve the re-conﬁgurability with
analog circuitry. Measurement results of those prototype ICs show up to 31× and 6.8× EDP
reductions with negligible (≤1%) accuracy degradation, respectively.
7
Chapter 4 applies DIMA to more complex algorithms: (1) CNN, and (2) SDM to show
the versatility of DIMA. Especially, algorithm is optimized to maximize the DIMA's beneﬁt
by following two techniques: error-aware retraining for CNN and hierarchical decision in
SDM.
Chapter 5 extends the DIMA to programmable instruction set architecture (ISA), namely
MATI. The instruction set is designed to be aligned well with the common functional ﬂow of
nine ML benchmarks employed in MATI. Simulation results show that the MATI achieves
EDP improvement of up to 12.6Ö over ﬁxed-function digital architectures. In addition,
the MATI's programming overhead is negligible in energy (< 0.1%), and area (4.5%), and
throughput, over multi-functional DIMA even with programmability
Chapter 6 concludes this dissertation and provides future research direction.
8
Chapter 2
DEEP IN-MEMORY ARCHITECTURE (DIMA)
This chapter provides an overview of DIMA, where a common functional ﬂow across
a diverse set of inference algorithms is identiﬁed and mapped to the sequential ﬂow of
four processing stages on DIMA. DIMA's underlying design principles are also explained to
demonstrate DIMA's robustness despite its low-swing operations. Furthermore, this chapter
provides practical design guidelines and techniques for the circuit and architectural imple-
mentations. Finally, energy, delay, and behavioral models of analog circuitry are provided
in the following chapters.
2.1 DIMA Overview
Figure 2.1(a) describes typical functional ﬂow of ML algorithms, which compute the vector
distance (VD) between N -dimensional vectors D (stored data) and P (input pattern) fol-
lowed by a thresholding function f( ) to generate the decision y. The VD is obtained by
computing the element-wise scalar distance (SD) and then aggregating these. Figure 2.1(b)
lists required operations for each algorithm (e.g., VD: dot product, and f( ): sign in SVM).
This common algorithmic ﬂow is mapped to four sub-blocks of DIMA architecture corre-
sponding to following processing stages: (1) multi-row functional read (FR) for fetching
data D, (2) BL processing (BLP) for SD computations, (3) cross BL processing (CBLP) for
the aggregation of SD results, and (4) ADC and residual digital logic (RDL) for realizing
thresholding decision function f( ).
The conventional digital architecture (Fig. 2.2(a)) and DIMA [3537,39] (Fig. 2.2(b)) both
employ identical BCAs to store D and an input buﬀer to store streamed P . A key diﬀerence
being, in DIMA, Ncol analog SD computations are embedded next to the BLs via BLPs,
while the digital architecture needs to fetch out the data from the memory before processing.
9
   ×
  
×
  
×
  
+
···
 1/N
 
 
 
 
···
   
   
f( )
(a)
Algorithm Vector distance
Scalar
distance
f( )
SVM Dot product Multiplication sign
TM
Manhattan
distance
Absolute 
difference
min
k-NN
Manhattan
distance
Absolute 
difference
majority
vote
MF Dot product Multiplication max
(b)
Figure 2.1: Functional ﬂow of inference algorithms: (a) functional diagram, and (b)
operations for each algorithm.
R
o
w
 d
ec
o
d
er
R
o
w
d
ec
o
d
er
Digital processor Decision ( )
K-b bus
d0
WL
driver Precharge
SA SA SA
Mux & buffer
Memory
    
L:1
col. mux
L:1
col. mux
L:1 
col. mux
Input buffer ( )
d1 d2 d3 … …
(a)
FR
 r
o
w
 d
ec
o
d
er
FR
 r
o
w
 d
ec
o
d
er
ADC
&
RDL
BLP
SA SA SA
Mux & buffer
K-b bus
    
    
Input buffer ( )
Cross BL processor (CBLP)
Decision ( )
BLP BLP BLP BLP
L:1
col. mux
L:1 
col. mux
d0
d1
d2
d3
BLP
L:1
col. mux
Precharge
WL
driver
(b)
Figure 2.2: Inference architecture (red marked drivers: turned-on wordline (WL) drivers at
a time): (a) conventional system, and (b) deep in-memory accelerator (DIMA).
10
Table 2.1: DIMA vs. conventional architecture (with Ncol ×Nrow bitcell array).
Attribute Conventional DIMA
data storage
pattern
row major column major
column mux
ratio
      
fetched words
per access
             
BL swing/LSB
(    )
250 – 300 mV 5 – 30 mV
# of rows
per access
1  
WL driver
fixed pulse
width
pulse width/amp
modulated
In this way, DIMA can bypass the column muxing requirements imposed on conventional
SRAMs. Furthermore, DIMA can directly aggregate the outputs of BLPs by CBLP to
generate a VD. Thus, the ﬁnal output of the analog section in DIMA is a VD instead of data
bits as in the case of a digital architecture. These diﬀerences are summarized in Table 2.1
and described as follows:
 Storage pattern: DIMA stores B bits of D in a column-major format vs. row-major
used in the digital architecture (Fig. 2.3(a)).
 Read access: per BL precharge (read cycle), DIMA reads a function of B rows or a
word-row vs. a single row in the conventional architecture. This process, referred to
as multi-row functional read (FR), generates a BL voltage drop ∆VBL proportional to
a weighted sum of the B bits per column [35] by using pulse-width modulated (PWM)
(Fig. 2.3(b)) or pulse amplitude modulated (PAM) WL signals [40, 41]. Thus, DIMA
needs many fewer precharge cycles to read the same number of bits and this leads to
both energy and throughput gains. However, in exchange, DIMA relaxes the ﬁdelity
of its reads as long as these fall within the error tolerance of the ML algorithm.
 Column muxing: unlike standard SRAMs which require an L : 1 column mux ratio
(typical L = 4 to 32) to accommodate a large-area sense ampliﬁer (SA) as shown in
Fig. 2.2(a), DIMA bypasses it via SD computations in BLPs whose horizontal dimen-
11
sion is matched to the column pitch of the BCA. Column muxing limits the number
of bits per access to Ncol/L in standard SRAM compared to NcolB in FR.
 Data reduction and decision: while the conventional architecture computes in digital
processor, DIMA implements SD and VD via BLP and CBLP right next to the BCA
using charge-based analog circuits. An ADC is used to digitize the analog CBLP
output and pass it on to the RDL to compute f( ) in Fig. 2.2(b). This ADC operates
once per 128-256 SD computations.
 Accuracy vs. energy: DIMA computations necessarily have a lower signal-to-noise
ratio (SNR) than the computations in the digital architecture. This loss in SNR arises
from the spatial transistor threshold voltage variations in the BCA which aﬀect the
FR process, and due to severe area-constraints on the BLP and CBLP. However, the
SNR can be tuned to a level required by the ML algorithm by adjusting the BL swing
∆VBL.
In summary, the key to DIMA's speed-up and energy advantages over a digital architecture
arises from its ability to read NcolB bits per access by FR, bypassing the column mux, and
via low-swing analog processing. The detailed operations of circuit and architecture in FR,
BLP and CBLP stages are described in the following sections.
2.2 DIMA Design Principles
This section explains DIMA design principles based on SNR budgeting. The ML algorithms
have inherent error resiliency due to following reasons: (1) the thresholding operation into
ﬁxed number of classes provides an algorithmic noise margin as small errors do not change
the classiﬁcation results, and (2) the aggregation of a large number of elements, widely
used for the dimensionality reduction in ML algorithms, makes the system insensitive to
component noise.
The inherent noise immunity can be exploited to achieve aggressive energy and throughput
beneﬁts for hardware implementations. Traditionally, supply voltage scaling has been widely
12
w
o
rd
 r
o
w
(B
ro
w
s
)
SRAM array (         )SRAM array (         )
L:1 L:1 L:1
SA SA SA
1 0 1 15 9 3 7 1193 1 49 65e.g.e.g.
1-bit
digital
        bits fetched per access         bits fetched per access
word
MSB
T
LSB
T
∆VBL: 
B -bit
analog
∆VBL:
1-bit
analog
Normal read Functional read (FR)
T3
(a)
         
    
   
   
        
        
    
    
Normal read Functional read (FR)
(b)
Figure 2.3: Comparing conventional read and DIMA multi-row functional read (FR)
operations: (a) fetched data, where B = 4 and L = 4 assumed, and red marked bitcells
read simultaneously, (b) bitline swing (∆VBL).
13
                Read ×+
    
Read ×+
    
Read ×+
    
+
···
SA[ -1:0]
···
······
 
threshold
1/N
 
 
 
 
 
 
SA[ -1:0]
SA[ -1:0]
                
                
 
 
 
 
 
 
 ···
(a)
  
FR BLP +
     
+
     
 
  
  
FR BLP +
     
+
       
  
FR BLP ++
      
+
    
1/N
CBLP
···
···
···
···
 
 
 
                
                
                
···
threshold
analog processing
(b)
Figure 2.4: Functional ﬂow of inference system: (a) conventional system, and (b) DIMA
system.
employed to achieve energy eﬃciency, but at the cost of signiﬁcantly degraded throughput
(e.g., quadratically degraded to voltage scaling [42]). Moreover, the error tends to happen
at MSB in digital processors due to the carry ripple behavior in arithmetic operations [43].
Another potential approach to exploit the ML's error immunity is reducing the ∆VBL to
save memory read energy. However, the conventional architecture with separated memory
and processor does not allow enough ∆VBL scaling. This is because the ∆VBL needs to
be converted to full swing to be processed in digital logic through SA, which acts as bit-
wise early decisions (Fig. 2.4(a)). If early decision errors happen at MSB due to η1,2,...,N by
∆VBL scaling, the error magnitude can increase beyond application's error immunity, leading
to a catastrophic failure of the inference system (e.g., peak signal-to-noise ratio (PSNR)
14
+  
 +
  
  
+
  
 +
  
  
Early
decision
Delayed
decision
threshold
(a)
SNR [dB]
B
it
 e
rr
o
r 
p
ro
b
ab
ili
ty
 p
e
(b)
Figure 2.5: Delayed vs. early decision scenarios: (a) simpliﬁed functional ﬂow with
x and xˆ ∈ {1,−1}, and (b) bit error probability pe vs. SNR.
< 10dB with MSB error [30, 31]). Being aware of this issue, several low-power SRAM
techniques [30, 31] were proposed for inference applications with selective MSB protection
achieving limited (35%) energy savings in the memory, but not in the processor.
To resolve these issues, DIMA eliminates the separation between the low-swing memory
and high-swing processor, but budgets SNR by employing aggressively low-swing operations
in both memory and process jointly (Fig. 2.4(b)). This is enabled by employing the FR,
which implicitly performs D/A conversion, and subsequent mixed-signal BLP and CBLP
processings. Despite degraded SNR, the system level robustness of DIMA is maintained
15
by following three principles: (1) delayed decision, (2) non-uniform bit protection, and (3)
aggregation.
2.2.1 Delayed Decision
Due to the absence of SAs between memory and processor, DIMA does not require early
decision right after BL discharge where noise ηa1 is added, but hard decision (thresholding)
happens only at the end of classiﬁcation process (Fig. 2.4(b)), namely delayed decision. In
this section, early and delayed decisions are compared with a simple example of binary bit
transmission (x ∈ {1,−1}) in Fig. 2.5(a), where noise distribution η1,2 ∼ N (0, σ2n), threshold
level 0, and equal prior P (X = 1) = P (X = −1) = 0.5 are assumed. The bit error
probabilities pe, assuming identical independent noise sources, can be derived as follows:
pe =

2Q( 1
σn
)[1−Q( 1
σn
)] for early decision
Q( 1√
2σn
) for delayed decision
(2.1)
Figure 2.5(b) shows pe behavior with respect to the SNR based on (2.1). It is clearly shown
that the delayed decision is more robust in a low-SNR regime as the early decision tends to
amplify noise contribution in the regime. In this sense, the delayed decision improves the
robustness of DIMA, where the SNR is tightly budgeted.
2.2.2 Non-Uniform Bit Protection
The FR with PWAM (Fig. 2.2(b)) allows us to assign voltage swing unequally based on the
signiﬁcance of information (bit positions), namely non-uniform bit protection. When the
total swing ∆VBL is budgeted for multi-row reading of B bits, the swing assigned for the
n-th bit position ∆Vn is as follows:
16
∆Vn =

∆VBL/B for uniform bit protection
∆VBL2
n−1/(2B − 1) for non-uniform bit protection
(2.2)
For example, the non-uniform bit protection gives roughly 2.1× more swing for MSB and
3.8× less swing for LSB compared to uniform bit protection when B = 4. In this way,
the limited resource (voltage swing) is more eﬃciently budgeted in the DIMA to reduce the
errors in more signiﬁcant bit positions and thus improve application level robustness.
2.2.3 Aggregation
The charge-sharing process in the CBLP eﬃciently averages out noise contributions by re-
ducing the standard deviation of output by
√
N times after aggregating N independent
mean-zero random noise sources in Fig. 2.4(b). Speciﬁcally, the DIMA's CBLP is eﬀective
in averaging out the noise in the SD computations as the error statistics of noise in the
analog circuitry closely follows independent mean-zero behavior (e.g., spatial threshold vari-
ation). Moreover, DIMA provides the average-out opportunity for both ηa1 and ηa2 before
the thresholding due to the delayed decision whereas the conventional digital architecture
goes through the early decision before having the aggregation opportunity.
2.2.4 Measured Result
This section proves the eﬀectiveness of the above mentioned principles based on measurement
results of a prototype IC in a 65 nm CMOS process, which is described in more detail in
Chapter 3. Figure 2.6(a) shows bit error rate (BER) measured by fetching 16 KB via the
conventional SRAM read and the energy trend by scaling ∆VBL. The measured BER is
injected to system simulations for face detection using SVM to evaluate the impact on the
17
00.1
0.2
0.3
0.4
0.5
0.6
0 100 200 300
B
it
 e
rr
o
r 
ra
te
 (B
ER
)
∆VBL [mV]
(a)
0
0.2
0.4
0.6
0.8
1
30 60 100 138 180 230 280 350
P
ro
b
ab
il
it
y 
o
f d
e
te
ct
io
n
 (P
de
t)
∆VBL,avg [mV]
Conventional simulated
(w/ BER measured from IC)
DIMA IC measured
(b)
Figure 2.6: Bit error rate (BER) vs. ∆VBL vs. application accuracy (face detection with
SVM) in conventional system: (a) BER of normal SRAM read operation measured from
prototype IC, and (b) application accuracy obtained by simulating a conventional system
with measured BER with measured DIMA accuracy.
18
application level accuracy in Fig. 2.6(b), where ∆VBL,avg is the average BL voltage swing
assigned per bit for a fair comparison of two systems. The DIMA's FR fetches B-bits
with single BL swing ∆VBL whereas conventional memory reads a single bit, leading to the
following deﬁnition of ∆VBL,avg:
∆VBL,avg =

∆VBL for conventional memory
∆VBL/B for DIMA
The setup for SVM including data set and image size is described in Chapter 3. It is
shown that 0.7% BER at ∆VBL = 230mV can cause 4% degradation in detection accuracy
of the conventional system. On the other hand, DIMA's measured accuracy shows much
more robustness with even lower ∆VBL,avg due to the above mentioned three principles.
Speciﬁcally, the eﬀective ∆VBL of conventional architecture is L times larger than those
shown in Fig. 2.6(b) as L BLs needs to be discharged to access only one of those through
column muxing. Thus, it can be concluded that the DIMA allows more aggressive ∆VBL
scaling compared to the conventional system, leading to signiﬁcant energy savings.
The following section focuses the realization of the above described system rationale and
implementation challenges and solutions.
2.3 Functional Read (FR)
This section explains the FR stage in detail, which performs data access and simple SD
computations. Two design techniques: (1) sub-ranged read, and (2) replica bitcell are in-
troduced to enhance the linearity of FR and achieve eﬃcient data writing, respectively. In
addition, design principles for key parameters such as pulse width T0 and amplitude VWL are
presented to minimize the impact of non-ideal behavior from low-swing analog processing.
19
VWL0
d0  
d1
CWL
CWL
VBL
VPRE
6T SRAM bitcell
VWL1
Prech
6T SRAM bitcell
VWL0
VWL0
d0 d0
CBL CBL
VBLB
(a)
T0
d3 d2 d1 d0
T1=2T0T2=4T0T3=8T0
∆VBL(D)
VBL
VWL3
VWL2
VWL1
VWL0
8∆Vlsb
4∆Vlsb 2∆Vlsb ∆Vlsb
(b)
Figure 2.7: Multi-row functional read (FR) using pulse width modulated access pulses [35]:
(a) column structure and bitcell, and (b) waveforms during a 4-bit word (D = 0000b′)
read-out (WL pulses are sequentially applied for visibility, but can be overlapped).
2.3.1 FR Operation
The FR stage generates the bitline voltage drop ∆VBL(D) proportional to the weighted sum
D = ΣB−1i=0 2
idi of column-major stored data {d0, d1, ..., dB−1} (see Fig. 2.7(a)). The voltage
drop ∆VBL(D) can be generated via a simultaneous application of PWAM access pulses to
multiple rows per precharge cycle. This is in contrast to the use of single-row ﬁxed width
and amplitude pulses per precharge cycle in conventional SRAM read.
Consider FR of B rows using PWM [35] with binary-weighted pulse widths Ti ∝ 2i (i ∈
[0, B − 1]) of VWL(i) as shown in Fig. 2.7(b) [35]. The charge ∆Qi(di) drawn from the BL
capacitance CBL is given by:
∆Qi(di) = diTiI(Ti) (2.3)
where I(t) is the current drawn by the ith bitcell. This current can be Taylor series approx-
imated as:
I(t) =
VPRE
Ri
e
− t
RiCBL ≈ VPRE
Ri
(1− t
RiCBL
) ≈ VPRE
Ri
(2.4)
provided t  RiCBL. Substituting t = Ti into (2.4) and the resulting expression for I(Ti)
20
into (2.3), we obtain:
∆Q(di) = diTi
VPRE
Ri
(2.5)
where Ti  RiCBL. Therefore, the expression for the total BL voltage drop ∆VBL(D) can
be obtained as follows:
∆VBL(D) =
∑B−1
i=0 ∆Qi
CBL
=
VPRE
CBL
B−1∑
i=0
diTi
Ri
(2.6)
As the pulse widths are binary weighted Ti = 2
iT0 where T0 is the LSB pulse width, and if
Ri = RBL, i.e., the discharge paths of all the B bitcells in a column have identical resistances,
then
∆VBL(D) =
VPRE
RBLCBL
T0
B−1∑
i=0
2idi = ∆Vlsb
B−1∑
i=0
2idi = ∆VlsbD (2.7)
where ∆Vlsb =
VPRET0
RBLCBL
, and D is the decimal value of the one's complement of D. The
expression in (2.7) is idealized as it assumes the following four conditions:
1. Ti  RiCBL
2. Ti = 2
iT0
3. Ri = RBL (no variation across rows)
4. RBL is a constant over VBL.
In practice, these conditions will not be fully met leading to a deviation, i.e., non-linearity,
from (2.7), and spatial variations from one group of B bits to another across the BCA. These
non-idealities and techniques to alleviate them will be described in Section 2.3.3.
A similar expression as (2.7) for ∆VBLB(D) can be obtained by replacing di with di in
(2.7). Thus, the FR stage converts the stored digital data D into bitline voltage drops
∆VBL(D) and ∆VBLB(D), i.e., the FR stage is a digital-to-analog converter. Additionally,
the FR stage can also realize simple SD functions such as the addition and subtraction of
21
two B-bit words (D and P ) stored diﬀerent rows but in the same column. For example,
from (2.7), D + P is obtained by applying FR to rows containing D and P to obtain:
∆VBL(D + P ) = ∆Vlsb(D + P ) =
B−1∑
i=0
2i(di + pi) (2.8)
Similarly, subtraction D − P can be realized by storing P (one's complement of P ) in the
same column as D. Subtraction will be discussed in Section 2.4 in more detail.
2.3.2 Design Guidelines
The BL swing ∆VBL generated by the FR stage is subject to the impact of spatial transistor
threshold voltage variations caused by random dopant ﬂuctuations [44], voltage-dependence
of the discharge path (access and pull-down transistor in the bitcell) resistance RBL (see
(2.7)), and the ﬁnite transition (rise and fall) times of the PWM WL access pulses. These
non-idealities can be incorporated into (2.7) as follows:
∆VBL(D) = ∆Vlsb
B−1∑
i=0
2idi(1 + γi)
(1 + ρi(VBL) + δi)
(2.9)
where δi is a random variable describing the the impact of spatial transistor threshold volt-
age variations on the discharge path resistance RBL aﬀecting Condition 3 (Ri = RBL) in
Section 2.3.1, ρi(VBL) is a variable that captures the impact of the BL voltage-dependence of
RBL which aﬀects Condition 4 (RBL should be a constant), and γi is a deterministic variable
that captures the impact of ﬁnite transition times on the pulse widths Ti aﬀecting Condition
2 (Ti = 2
iT0), with the aﬀect on the LSB pulse width T0 being most severe.
The presence of δi, ρi, and γi imposes certain design constraints in order to alleviate
their impact so that (2.9) approaches the ideal expression in (2.7). For example, ρi can
be reduced by ensuring that the access transistor in the discharge path does not transit
from saturation into the triode region to satisfy Condition 4. This can be achieved by
lowering the WL access pulse amplitude VWL, which has the additional beneﬁt that RBL is
increased thereby making it easier to satisfy the overarching Condition 1 (Ti  RBLCBL).
22
Similarly, the impact of γi can be alleviated by ensuring that the design parameter T0 is
lower bounded as T0 > Tmin so that the rise (Tr) and fall (Tf ) times of VWL are a small
fraction, e.g., Tr + Tf < 0.5Tmin, of T0 and hence Condition 2 (Ti = 2
iT0) can be met. That
and Condition 1 implies that Tmin < T0 < Tmax. Lastly, δi can be alleviated by ensuring that
VWL is suﬃciently large so that variations in Ri are reduced, i.e., Condition 3 (Ri = RBL) is
approximated well. This lower bound on VWL can be relaxed as DIMA's aggregation process
in the CBLP compensates for the impact of δi. However, VWL does have an upper bound
to avoid destructive read operation, e.g., VWL < 0.8VPRE. Hence, for the prototype IC, we
chose VWL = 0.65VPRE. Note that it is possible to pre-distort the data stored in the BCA
in order to alleviate the impact of deterministic errors ρi and γi.
The worst-case values of ρi and γi are estimated to be less than 41% and 37%, respectively,
as estimated from measured results of the multi-functional 65 nm CMOS prototype IC.
Monte Carlo post-layout simulations of the BCA shows that the impact of δi leads to a 12%
variation (σ/µ) in ∆VBL(D) for typical values of VWL = 0.65 V, VPRE = 1 V, T0 = 250 ps,
N = 128, B = 8, and Nrow = 512. Section 3.3.3 indicates that these non-idealities have a
negligible impact on inference accuracy for the data sets being considered in this work.
2.3.3 Design Techniques
We present two design techniques to overcome the design constraints described in Sec-
tion 2.3.2.
• Sub-ranged Read : Realizing a highly linear FR stage when B > 4 bits is challenging
because the constraint Tmin = 2(Tr +Tf ) < T0 < Tmax  21−BRBLCBL is hard to meet. For
example, T0 < 125 ps when B = 5 and T4 = 2 ns. This value of T0 is hard is achieve when
driving high WL capacitance (e.g., 200 fF) with a row pitch-matched wordline driver. The
sub-ranged read technique solves this problem as described next.
In sub-ranged read [39], the B/2 MSBs representing data DM and B/2 LSBs representing
data DL are stored in adjacent columns of the BCA as shown in see Fig. 2.8(a). For example,
when B = 8, DM = 8d7 + 4d6 + 2d5 + d4 and DL = 8d3 + 4d2 + 2d1 + d0. Three switches
φ1,2,3 are used in speciﬁc sequence to charge share the BL voltages. Explicit tuning capacitor
23
d7
d6
d5
d4
VBLM(DM)
d3
d2
d1
d0
VBLL(DL)
ø1 ø2
ø3BLPM BLPL
VBL(D)
(a)
CBLP
CBL
ø2
VBLM(DM)
ø1
CBL
ø3
CBLP
Ctune
VBLL(DL)
CM CL=(1/16)CM
VBL(D)
(b)
Figure 2.8: Sub-ranged read with B = 8: (a) BL pair structure (two neighboring bitcell
columns), and (b) equivalent capacitance model [39], where DM = 8d7 + 4d6 + 2d5 + d4
and DL = 8d3 + 4d2 + 2d1 + d0.
Ctune enables the realization of a predeﬁned capacitance ratio (16 : 1 for B = 8) between the
MSB and LSB BLs' capacitances CM and CL, respectively. This ratio is needed to weigh
the voltage drop on the MSB BL (∆VBLM) by a factor of 2
B
2 more as compared to the
voltage drop on the LSB BL (∆VBLL). The desired capacitance ratio is obtained by setting
CM = CBL + CBLP + Ctune and CL = CBLP as shown in Fig. 2.8(b), and varying Ctune
modiﬁes CM so as to realize CM : CL = 2
B
2 : 1.
The sub-ranged read proceeds as follows:
1. the FR process is simultaneously applied to the MSB and LSB columns with φ1,2,3 = 0
(all open) thereby generating voltage drops ∆VBLM(DM) and ∆VBLL(DL), respectively.
2. the switch φ1 is closed so that the voltage ∆VBLM(DM) is developed across CM . Switch
φ2 is pulsed to generate the voltage ∆VBLL(DL) on CL.
3. the switch φ3 is closed to generate the voltage drop at the ﬁnal output
∆VBL(D) =
1
CM + CL
[CM∆VBLM(DM) + CL∆VBLL(DL)] (2.10)
=
1
2
B
2 + 1
[
2
B
2 ∆VBLM(DM) + ∆VBLL(DL)
]
(2.11)
24
V
B
L
d0
d1
d2
d3
VWL0
VWL1
VWL2
VWL3
p0
p1
p2
p3
VWLP0
VWLP1
VWLP2
VWLP3
W
B
L
WWL3
WWL2
WWL1
WWL0
V
B
LB
Replica
BCA
Normal
BCA
WWL0
WWL3
WWL2
WWL1
VWLP3
p3
p3
WWL3
WWL3
W
B
L
B
L
B
LB
Replica bitcell
VWLP3
(a)
WBL p0 p1
WWL0
WWL1
0 p2 p3 0
WWL2
WWL3
(b)
Figure 2.9: Replica BCA: (a) bitcell column (B = 4), and (b) timing diagram for replica
BCA writing [39].
Thus, for B = 8, the voltage drop ∆VBLM(DM) is weighted 16× more than ∆VBLL(DL) to
obtain the voltage drop ∆VBL(D) which is proportional to 16DM +DL.
• FR Replica BCA: As described in Section 2.3.1, two operands D and P are required to
implements various SD computations. For example, in order to realize the diﬀerence D−P
(see (2.8)), P is stored in the same column as D but in a diﬀerent row. Typically, P is a
streamed in data, e.g., a template in template matching or image pixels. Storing P in the
same BCA as D will require repeated SRAM write operations which incurs large energy
and delay costs as these require full BL swing. This problem can be solved via the use of a
replica BCA as described next.
25
SRAM bitcell
0.915 μm
13 μm
2.11 μm
symmetric
(a)
SRAM bitcell
2.11 μm
13.8 μm
shielding
full-swing digital
low-swing analog
(b)
Figure 2.10: Pitch-matched layouts of BLP blocks relative to a SRAM bitcell: (a) analog
comparator, and (b) a part (1/5) of charge redistribution-based multiplier.
The replica BCA (Fig. 2.9(a)) enables fast writes of P via separate write BL (WBL) and
WL (WWL) [39]. Thus, P can be written into the replica BCA column by providing data
in a bit-serial manner via the WBL (Fig. 2.9(b)) while disabling the cross-coupled inverter
feedback loop in the replica bitcell via WWL. During a subsequent FR, the replica BCA
behaves as an extension of the regular BCA. The layout of replica bitcell needs to be similar
to normal bitcell to have the same discharge strength except the needs of WBL and WWL
circuitry.
2.4 Bitline Processing (BLP) and Cross BLP (CBLP)
This section describes BLP and CBLP stages, which perform various SD computations
and aggregation, respectively. The BLP and CBLP operations rely on tightly pitch-matched
analog processing (see Fig. 2.10) resulting in many implementation challenges. Thus, design
26
principles for key parameters are presented based on the analysis of various noise sources
such as charge injection, thermal noise, and coupling noise.
2.4.1 BLP Operation
The Ncol BLP block in Fig. 2.2(a) accepts two operands: (1) its corresponding bitline
voltage drop ∆VBL(D) generated via the FR stage, and (2) a word P to generate an output
voltage VB(D,P ). The BLP block needs to be re-conﬁgurable in order to support multiple
SD computations required by various ML algorithms. Furthermore, the BLP block layout
needs to be column-pitch matched to the BCA. Thus, the BLP stage is a massively parallel
analog SIMD processor.
Next, we describe how SD functions such as absolute diﬀerence |D − P | [35] and mul-
tiplication DP [36] can be implemented in the BLP. These SD functions are required to
compute commonly used VDs: Manhattan distance (MD) (
∑N
i=1 |Di−Pi|) and dot product
(DP) (
∑N
i=1DiPi).
2.4.1.1 Absolute Diﬀerence
The absolute diﬀerence |D − P | can be written as [35]:
|D − P | = max(D − P, P −D) (2.12)
From (2.8), the bitline voltage drop corresponding to the diﬀerence D − P is obtained as:
∆VBL(D − P ) = ∆Vlsb(D + P ) (2.13)
The intrinsically diﬀerential structure of the SRAM bitcell enables one to evaluatemax(VBL, VBLB),
i.e., the voltages on the BL and BLB, quite easily via the use of a local BL compare-select
as shown in Fig. 2.11(a). Thus, from (2.12) and (2.13), we get
27
VWL1
VWL2
VWL3
VWL0
M
R
-F
R
 o
f 
D
M
R
-F
R
 o
f 
P
VWL(x+1)
VWL(x+2)
VWL(x+3)
VWL(x+0)
ENVBL
+ -
01
COMP
OUT
EN
MUX
VB
6T SRAM
bitcell
EN
VWL0
VWL1
6T SRAM
bitcell
VBLB
(a)
C
C
C
ø3,2
ødump
ø3,1
VPRE
C
ø3,3
C
ø3,0
ødump
ødump
ødump
ødump
VPRE
VPRE
VPRE
VPRE
VB
ø2,X 
ø2,0 (p0)
ø2,1 (p1)
ø2,2 (p2)
ø2,3 (p3)
ødump
ø2,X
ø2,0
ø2,1
ø2,2
ø2,3
ø3,0
ø3,1
ø3,2
ø3,3
VBLB(D)
if p0=0
if p1=0
if p2=0
if p3=0
(b)
Figure 2.11: Bitline processing (BLP): (a) absolute diﬀerence, where D and P are stored in
the same column [35], and (b) charge redistribution-based multiplication with B = 4 [36].
28
max(VBL, VBLB) = max(VPRE −∆Vlsb(P +D), VPRE −∆Vlsb(D + P ))
= max(VPRE −∆Vlsb(P + 2B − 1−D), VPRE −∆Vlsb(D + 2B − 1− P ))
= VPRE − (2B − 1)∆Vlsb + ∆Vlsbmax(P −D,D − P ) (2.14)
Thus, applying FR to D and P simultaneously results in VBL and VBLB being proportional
to P −D and D−P , respectively. The local BL compare-select block (Fig. 2.11(a)) provides
the maximum of VBL and VBLB, and hence the absolute diﬀerence |D − P |.
2.4.1.2 Multiplication
Figure 2.11(b) shows a charge redistribution-based mixed-signal multiplier with inputs ∆VBLB(D)
(FR stage output) and an externally provided B-bit digital word P , whose bits pi control
the φ2,i switches. The multiplier output voltage VB is given by:
VB(D,P ) = VB(DP ) = VPRE − (0.5)BP∆VBLB(D) = VPRE − (0.5)B∆VlsbDP (2.15)
Thus, voltage drop ∆VB(DP ) = VPRE−VB(DP ) ∝ DP represents the product of D and P .
Note that, the multiplier employs unit size (25 fF) capacitors rather than binary-weighted
ones as in [45], due to stringent column pitch-match constraints on the BLP.
The timing diagram in Fig. 2.11(b) describes the operation of the multiplier, which is also
summarized below:
 ﬁrst, the unit capacitors are charged to ∆VBLB(D) by pulsing the φdump switches.
 then, the switch φ2,i is pulsed only if pi = 0 thereby charging the capacitor correspond-
ing to i-th bit VPRE. The capacitors corresponding to pi = 1 retain the voltage equal
to ∆VBLB(D).
 ﬁnally, the switches φ3,i are pulsed sequentially starting from φ3,0, φ3,1 . . . φ3,B−1. This
leads to charge sharing between adjacent capacitors.
29
ø1
ø2CS
ø2
VB(D1,P1) VB(D2,P2)
CS
ø1
VC
VB(DN,PN)
CS
ø1
ø2
Figure 2.12: Cross bitline processing (CBLP).
Note that when φ3,k (k = 0, . . . , B− 1) is pulsed, the two charge sharing capacitors settle to
a voltage of:
VPRE − (0.5)k+1(2kpk + ...+ 2p1 + p0)∆VBLB(D) (2.16)
thereby realizing (2.15) when k = B − 1.
2.4.2 Cross Bitline Processing (CBLP), ADC, and Residual Digital Logic
(RDL)
The cross BL processor (CBLP) (Fig. 2.12) samples the output voltage (VB(D,P )) of the
BLP on the BL-wise sampling capacitors CS at each column by pulsing the φ1 switches.
Next the φ2 switches are pulsed to generate the CBLP output VC in one step. In this way,
CBLP implements dimensionality reduction, which is a widely used function in inference
algorithms. Finally, the CBLP output VC is converted to the digital domain by the ADC
to be stored or further processed by the RDL. The RDL implements slicing/thresholding
functions such as min, max, sign, sigmoid, and majority vote. Note that the ADC and
RDL need to process one scalar value (VC) generated from a massively parallel (> 128) SD
processing step in the BLP. Thus, the energy overhead of ADC and RDL is negligible.
30
2.4.3 Design Guidelines
The BLP can be conﬁgured to compute the absolute diﬀerence |D − P | (MD mode) or the
scalar product DP (DP mode).
The dominant source of non-ideality in computing |D−P | is due to the comparator oﬀset
in the compare-select block (see Fig. 2.11(a)). However, this input oﬀset aﬀects the BLP
output VB minimally. This is because the input oﬀset aﬀects the output only when VBL
and VBLB are close to each other. Thus, the BLP output VB = max(VBL, VBLB) in (2.12) is
supposed to have an error of only small magnitude |VBL−VBLB|. Additionally, the error in VB
being uncorrelated across the columns gets averaged out further by the CBLP. The column
pitch-matched comparator layout shown in Fig. 2.10(a) is constrained to be symmetric to
minimize the input oﬀset. Monte Carlo post-layout simulations in the 65 nm CMOS process
indicates that the input oﬀset follows the distribution N (0, (10mV)2).
Computation of the product DP in the BLP (see Fig. 2.11(b)) and summation in the
CBLP (see Fig. 2.12) is done via charge redistribution circuits. These circuits suﬀer from
multiple noise sources: (1) charge-injection noise, (2) coupling noise, and (3) thermal noise.
Assuming a junction capacitance of 0.05 fF [46] using minimum sized switches, we ﬁnd that
storage capacitance C needs to be larger than 13 fF in order to ensure 8-b output precision.
The 8-b precision in a swing of 300 mV results in a resolution of Vres = 1 mV. Hence, thermal
noise considerations (
√
KT/C < 0.5Vres) lead to the requirement of C > 17 fF at T = 300
K. Hence, we chose C = 25 fF to provide suﬃcient design margin.
Due to the tight pitch-matching constraints, digital signals need to be routed over analog
nodes in the BLP and CBLP generating signiﬁcant coupling noise. In order to alleviate
coupling noise, low-swing analog nodes were shielded from the digital full-swing lines as
shown in Fig. 2.10(b).
31
2.5 Models for Circuit Non-Ideal Behavior, Energy, and Delay1
The analog-intensive DIMA operation is subject to a number of circuit-level non-idealities
[35]. This section presents comprehensive behavioral models of dominant non-idealities in
each analog signal processing step for the prediction of application accuracy. Energy and
delay models are also provided as a function of major design parameters.
The major non-idealities are as follows:
(a) non-linearity of the FR process due to the voltage-dependent resistance of BL discharge
path
(b) local transistor threshold voltage Vth-mismatch across bitcells caused by random
dopant ﬂuctuations
(c) non-ideal sub-ranged read due to the inaccuracy of capacitance ratio between the MSB
and LSB columns
(d) input oﬀset of the analog comparator
This section focuses on Manhattan distance (MD) to analyze a variety of noise sources.
A similar analysis can be applied to the other VD kernels.
2.5.1 Circuit-Aware Behavioral Model
2.5.1.1 FR with Subtraction (summation)
The FR for MD computation is simulated with HSPICE and the non-linearity is modeled
by a polynomial equation given by:
∆V ′BL(D,P ) = c2(D + P )
2 + c1(D + P ) + c0 (2.17)
where ∆V ′BL(D,P ) is a distorted version of ∆VBL(D,P ), and the ﬁtting parameter c0, c1,
and c2 depend upon RBL, CBL, T0, the process parameters including Vth, carrier mobility,
saturation carrier velocity, and the channel length modulation parameter. The expressions
1This section is adopted from M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz, An
energy-eﬃcient VLSI architecture for pattern recognition via deep embedding of computation in SRAM, in
39th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).© 2014 IEEE
32
for ∆V ′BLB(D,P ) can be obtained by substituting P and P in (2.17) with D and D, respec-
tively.
2.5.1.2 Variation of FR
The impact of Vth-mismatch is modeled as Gaussian distributed random variables as shown
below:
∆V̂BL(D,P ) ∼ N (∆V ′BL(D,P ), σ24VBL(D,P )) (2.18)
where σ24VBL(D,P ) is the variance of ∆V̂BL(D,P ) due to Vth-mismatch across bitcells.
The variance σ24VBL(D,P ) is expressed as follows by assuming that the BL voltage drop
from reading D and P are independent:
σ24VBL(D,P ) = σ
2
4VBL(D) + σ
2
4VBL(P ) (2.19)
Furthermore, the voltage drop by FR of D can be modeled as an addition of binary scaled
B independent Gaussian random variables when sub-ranged read is employed for every B
bits. Thus, the variance σ24VBL(D) can be expressed by:
σ24VBL(D) = {(2B−1dB−1)2 + ...+ (22d2)2 + (2d1)2 + (d0)2} × σ24VBLB(D=1) (2.20)
The σ24VBL(P ) can be achieved by replacing the di in (2.20) into pi.
2.5.1.3 Sub-Ranged Read
The parasitic capacitance ratio between BLM and BLL needs to be M(= 2B/2) : 1 for the
sub-ranged read as shown in (2.10). Inaccuracy of the ratio (M : 1) causes a non-ideality,
which is modeled by:
33
V̂BL(D,P ) =
M
M + 1
[{VPRE −∆V̂BLM (DM , PM )}
+{VPRE −∆V̂BLL(DL, PL)}/M ] (2.21)
2.5.1.4 Absolute Operation
The non-ideal BLP output V̂B,j from the j-th column in DIMA is expressed with the ideal
value VB,j as follows:
V̂B,j = g1{QV̂BL(Dj, Pj) +QV̂BLB(Dj, Pj)}+ g0 (2.22)
Q =
1 if V̂BL(Dj, Pj) > V̂BLB(Dj, Pj) + VOS0 otherwise
where VOS ∼ N (0, σ2comp) is input oﬀset voltage of the comparator which is modeled as a zero
mean Gaussian random variable with variance σ2comp. Coeﬃcients g0 and g1 model voltage
change due to the charge sharing between the BL and sampling capacitor, and leakage
current.
2.5.1.5 Sampling and Capacitive Addition
The thermal noise in the sampling capacitor is negligibly small as compared to other noise
sources by employing capacitors larger than 10 fF . The non-ideal CBLP output (V̂ ) from
charge sharing N sampling capacitances can be modeled as follows:
V̂C =
1
N
N∑
j=1
V̂B,j (2.23)
2.5.2 Throughput Model
The throughput of conventional systems is limited by the memory access delay in memory-
intensive algorithms. In such applications, throughput improvement factor S by the DIMA
34
can be expressed as follows:
S =
1
2
(LB)(
Tconv
TDIMA
) (2.24)
where the Tconv is the cycle time for single read access of conventional system, and TDIMA is
the sum of the delays of FR, BLP, and CBLP. The TDIMA is generally larger than Tconv as
DIMA accesses multiple rows to fetch a vertically stored word and complete the following
analog operations. The scaling factor B is due to the FR process reading B rows per
access. In addition, the DIMA bypasses the L : 1 column mux resulting in scaling factor
L. The scaling factor 1
2
occurs because the sub-ranged read reduces eﬀective throughput.
Hence, higher throughputs (S = 3.2 ∼ 12.8) can be easily achieved, e.g., by setting B = 8,
L = 4, 8, 16, and TDIMA = 5Tconv.
2.5.3 Energy Model
The energy consumptions of DIMA and conventional architecture per read operation and
following processing of a B-bit word for MD computation are modeled, respectively, as
follows:
Econv = BLCBL∆VBLVPRE +BESA + Eleak + Elogic (2.25)
EDIMA = 4CBL(∆VBL)VPRE + Ecomp + Eleak/S + EBLP (2.26)
The energy eﬃciency of DIMA can be evaluated by comparing (2.25) and (2.26). The
energy component Elogic in (2.25) corresponds to EBLP in (2.26). The Ecomp in (2.26) is
the energy for an analog comparator, and this is almost the same as the sense ampliﬁer
energy ESA in (2.25). The ﬁrst term in (2.26) does not have the scaling factor of L as DIMA
bypasses the L : 1 column muxing. The scaling factor B is missing in the ﬁrst and second
35
terms of (2.26) because the FR processes a B-bit word per single BL discharge. However,
the scaling factor of four is created in the ﬁrst term of (2.26) to read not only D but also
P and employ the sub-ranged read. Energy savings can be observed by comparing the ﬁrst
and second terms because LB ranges from 32 to 128 when B = 8. It is assumed that a deep-
sleep mode is enabled during standby using techniques such as power gating or lowering the
VDD for the BCA [4749]. Therefore, Eleak = PleakTconv, where Pleak is the leakage power
consumption. The DIMA's intrinsically parallel operation makes its eﬀective read cycle time
smaller by a speed-up factor of S (S > 3), therefore the leakage energy is also reduced. The
EBLP is smaller than the Elogic due to a low-swing signal processing.
2.5.4 Prediction of Application Accuracies
In this section, the application accuracy is predicted in terms of probability of detection
(PDET ) as a function of major design parameters based on the behavioral models described
in the previous sections.
The decision of pattern matching with the DIMA based on distance metrics can be rep-
resented as follows:
uopt_DIMA = arg min
u
{f(u) + ηu, u = 1, ..., U} (2.27)
where uopt_DIMA is the decision from DIMA to ﬁnd the index of the candidate with minimum
distance. The f(·) is a error-free distance metric, U is the number of candidates, and ηus
are independent random variables representing noise due to the non-idealities of FR, BLP,
and BLP for the u-th candidate. The PDET of DIMA can be deﬁned as follows:
PDET = Pr{uopt = uopt_DIMA} (2.28)
= Pr{f(uopt_DIMA) + ηuopt_DIMA < f(u) + ηu, u = 1, ..., U}
where uopt is the correct index of the candidate with minimum distance. The circuit-aware
behavioral models from (2.17) to (2.23) will be applied to (2.28) to predict the applica-
36
tion level accuracy. In this section, the only local transistor Vth-mismatch across bitcells is
considered as the only source of non-ideality in DIMA as it is the dominant source of error.
Equation (2.21) can be expressed with an error-free ideal BL voltages as follows:
VˆBL(D,P ) = VPRE −
( M
M + 1
)
∆VBLM (DM , PM )
−( 1
M + 1
)
∆V BLL(DL, PL) + η∆VBL
= VBL(D,P ) + η∆VBL (2.29)
where η∆VBLB is additive Gaussian noise with mean zero and variance described by:
σ2∆VBL(D,P ) =
M2
(M + 1)2
σ2∆VBL(DM ,PM ) +
1
(M + 1)2
σ2∆VBL(DL,PL) (2.30)
Similarly, VˆBL(D,P ) can be denoted as follows:
VˆBL(D,P ) = V BL(D,P ) + η∆VBL (2.31)
The non-ideal BLP output VˆB,j(D,P ) in (2.22) can be described with equivalent additive
noise ηB,j as follows, but neglecting the comparator oﬀset for simplicity:
VˆB,j = g1 max{VBL(Dj , Pj) + η∆VBL(Dj ,Pj), VBLB(Dj , Pj) + η∆VBLB(Dj ,Pj)}+ g0
= VB,j + ηB,j (2.32)
where VB,j = g1 max{V BL(Dj, Pj), VBLB(Dj, Pj)} + g0, which is an error-free BLP output.
The ηB,j is not a Gaussian random variable due to the maximum operation although η∆VBL
and η∆VBLB are independent Gaussian random variables [50]. However, the mean and vari-
ance of ηB,j can be expressed as follows [51]:
E[VˆB,j ] = g1[µ1Φ(α) + µ2Φ(−α) + aφ(α)] (2.33)
37
V ar[VˆB,j ] = g
2
1(µ
2
1 + σ
2
1)Φ(α) + g
2
1(µ
2
2 + σ
2
2)Φ(−α)
+g21a(µ1 + µ2)φ(α)− (E[VˆB,j ])2 (2.34)
where
µ1 = V BL(Dj , Pj); µ2 = V BLB(Dj , Pj); σ
2
1 = σ
2
∆VBL(D,P )
; σ22 = σ
2
∆VBLB(D,P )
;
a2 = σ2∆VBL(D,P ) + σ
2
∆VBLB(D,P )
, α =
VBL(Dj , Pj)− V BLB(Dj , Pj)
a
where φ(x) and Φ(x) are the Gaussian probability density function and cumulative density
function, respectively.
The VˆB,j in (2.23) can be substituted by (2.32) as follows:
VˆC = VC + ηC (2.35)
where VC =
1
N
∑N
j=1 VB,j and ηC =
1
N
∑N
j=1 ηB,j are equivalent additive noise present in
VC . Although ηB,j is not a Gaussian random variable, the distribution of ηC approaches
a Gaussian distribution N (E[ηC ], V ar[ηC ]) as L increases by the central limit theorem [50]
with the following mean and variance:
E[ηC ] =
1
N
N∑
j=1
E[VˆB,j ]− VC
V ar[ηC ] =
1
N2
N∑
j=1
V ar[VˆB,j ] (2.36)
Now (2.28) can be expressed with ηC , and the PDET is given by:
PDET = Pr{f(uopt) + ηC,uopt < f(u) + ηC,u, u = 1, ..., U} (2.37)
38
where error-free MD computation kernel f(·) can be expressed as follows:
f(u) =
1
L
L∑
j=1
[g1 max{V BLB(Du,j , Pj), V BL(Du,j , Pj)}+ g0] (2.38)
For simplicity, the probability that every f(u)+ηC,u in (2.37) is larger than constant value
x is deﬁned by:
h(x) =
∏
u6=uopt
Pr{x < f(u) + ηC,u} (2.39)
=
∏
u6=uopt
Q
(
x− (f(u) + E[ηC,u])√
V ar[ηC,u]
)
where x ranges from 0 to 1 as the voltage level of CBLP output cannot be less than 0 and
higher than VDD = 1.
The f(uopt) + ηC,uopt in (2.37) is a random variable whereas x in (2.39) is a constant value.
Thus, h(x) is integrated with probability density function of f(uopt) + ηC,uopt over the entire
dynamic voltage range as follows:
PDET =
∫ 1
x=0
h(x)φ
(
x− (f(uopt) + E[ηC,uopt ])√
V ar[ηC,uopt ]
)
dx (2.40)
2.5.5 Simulation Results
This section provides validation of behavioral models from (2.17) to (2.23) by HSPICE
Monte-Carlo simulations in a 65 nm CMOS process technology. Then, PDET of practical
application is obtained via Monte-Carlo simulations using the validated behavioral mod-
els. Finally, the PDET s from Monte-Carlo system simulations are compared with the PDET
predicted by (2.40).
39
Table 2.2: Design and model parameters in (2.17)-(2.23).
Parameter Values Parameter Values
VDD 1 V VWL 0.4 - 0.9 V
VPRE 1 V L 4, 8, 16
array size 256× 512 N 2 - 256
T0 300 ps M 16
B 8 CLK freq. 1 GHz
σcomp 10 mV g0, g1 0.890, 0.086
σ4VBL/µ4VBL(D = 1) 6.4%
c0 = −3.04× 10−3, c1 = 0.037, c2 = −2.43× 10−4
2.5.5.1 System Conﬁgurations
Two banks with 512×256 BCA per bank are assumed to store a dataset ofDs. Bit precision
B = 8 is chosen for D and P , and sub-raged read is applied for every 4 bits as shown in
Fig. 2.8(a). The L is chosen to be 4, 8, and 16, which are the most widely used values. It
is assumed that ∆V BL(B) = 250 mV and VWL = 1 V in a conventional SRAM to achieve
sensing margin. A metal oxide metal (MOM) capacitor is used to implement the sampling
capacitor in order to balance leakage reduction and area eﬃciency. The analog comparator
is sized to ﬁt in the horizontal dimension of the SRAM bitcell, and 10 mV of input oﬀset
is assumed for system simulations [52]. Numerical values of design and model parameters
validated in the following paragraphs are summarized in Table 2.2.
2.5.5.2 Model Validations
Figure 2.13 shows the result from HSPICE simulations for FR when addition (or subtraction)
of 4-bit data D and P are obtained by FR. The non-linearity is measured by integral non-
linearity (INL), and the dynamic range of ∆VBLB is limited to 0.9 V in order to obtain INL
within 2.5 LSB. The ﬁtting parameters for the model of (2.17), c0 = −0.003041, c1 = 0.037,
and c2 = −0.000243, result in a modeling error less than 0.3 LSB. Figure 2.14 shows the
accuracy of sub-ranged read for D with B = 8 from HSPICE simulation. The INL is less
than 6 LSBs and the modeling errors of (2.21) with M = 16 is smaller than 1.6 LSBs.
The complete behavioral models covering the entire analog signal processing chain from
40
-7
-6
-5
-4
-3
-2
-1
0
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20 25 30 35
IN
L 
(L
SB
)
∆
V
B
LB
[V
]
D + P (LSB)
Simulation
Model
INL
Dynamic
Range
INL
∆VBLB
D + P (4b-data)
Figure 2.13: ∆VBLB from simulations and model (2.17) during FR with B = 4, VPRE = 1
V, and T0 = 300 ps.
0.5
0.6
0.7
0.8
0.9
1
1.1
0 50 100 150 200 250
V
B
L
[V
]
Model (M=14.2)
Model (M=16)
Simulated VBL
D  [8b-data]
V
B
L 
[V
]
Figure 2.14: Sub-ranged read with HSPICE simulations and behavioral models with
M = 16 (2.21).
41
500
550
600
650
700
750
800
850
900
Case1 Case2 Case3 Case4 Case5 Case6 Case7
V
C
[m
V
]
Simulation
Model
Figure 2.15: VC from circuit simulations and system simulations with behavioral models
(2.17) to (2.23) when N = 8 and VWL = 800mV .
the bitcell (D) to the ﬁnal output (VˆC) with N = 8 are validated in Fig. 2.15. The VC is
achieved from HSPICE Monte-Carlo simulations, and compared with the output obtained
from the behavioral models described by (2.17)-(2.23) as shown in Fig. 2.15. Seven diﬀerent
combinations of D and P are chosen to measure the accuracy of behavioral models. The
maximum error of the models is 4.4% of the dynamic range of VC and this is suﬃciently
accurate to measure relative magnitude of diﬀerent VCs.
2.5.5.3 Error Behavior
The 4VBL(D) is measured with diﬀerent values of VWL from HSPICE simulations as shown
in Fig. 2.16. The D = 14 = 1110(2) is chosen to maximize σ4VBL(D)/µ4VBL(D) as the BL
is discharged by a single transistor without an averaging-out eﬀect. The σ4VBL(D) is also
measured from Monte-Carlo simulations with HSPICE to estimate σ24VBL(D,P ) via (2.19) and
(2.20), which are used for the system simulations to estimate the PDET of DIMA. Energy
eﬃciency of DIMA is improved by reducing VWL due to low BL swing whereas the variation
of FR is increased as shown in Fig. 2.16 as the access transistor operates in near-threshold
voltage regime.
42
00.05
0.1
0.15
0.2
0.25
0.3
0.35
0
10
20
30
40
50
300 400 500 600 700 800 900 1000
σ
Δ
V
B
L/
μ
Δ
V
B
L
Δ
V
B
L
[m
V
]
VWL [mV]
Figure 2.16: 4VBL(D) and σ4VBL(D)/µ4VBL(D) for D = 14 with N = 4.
0
5
10
15
20
25
30
35
40
1 2 4 8 16 32 64 128 256 512
E
[η
2 C
] 
[m
V
2
]
Length of vector (N)
VWL = 400 mV
500 mV
600 mV
700 mV
800 mV
900 mV
Power of deterministic error
(X 102)
Figure 2.17: Expected hardware error in (2.35) for D = 128 and P = 240.
43
NN
Template (P)
Face image data (D)
Figure 2.18: Face recognition application.
Figure 2.17 shows the expected power of error E[η2C ] at CBLP output in the DIMA with
diﬀerent values of M and VWL. The ηB of (2.32) are obtained from HSPICE Monte-Carlo
simulations with D = 128 and P = 240, which is one of the input combinations, where the
random component of hardware error at BLB is maximized. Then, the E[η2C ] are obtained
through system simulations with (2.23) to observe the trend of error. The power of random
error is inversely proportional to N and VWL as indicated in (2.36) and Fig. 2.16. This
trend shows that the DIMA is favorable to large sized templates, where error from process
variation during the FR can be averaged-out better by (2.36), and thus more scaling down
of VWL is available to maximize energy eﬃciency.
2.5.5.4 Application
The DIMA is employed for face recognition with distance metrics of MD as (2.41) to evaluate
the impact of non-idealities from the DIMA on the application accuracy.
MD(Du,P ) =
N∑
j=1
|Du,j − Pj| (2.41)
44
As a face recognition is a multi-class classiﬁcation (128 classes in this application), it requires
much higher accuracy than simple applications such as binary classiﬁcation. Furthermore,
face recognition requires higher resolution than applications which need to distinguish en-
tirely diﬀerent types of objects, e.g., car vs. humans. Therefore, the face recognition is
chosen to evaluate the accuracy of DIMA conservatively.
One hundred twenty eight face images (D) from MIT CBCL data set [53], each of which
is a gray scale image with
√
N × √N pixels and B = 8, are pre-stored in the BCA per
candidate (see Fig. 2.18). If N pixels cannot be accessed within one FR, the entire stages
including FR, BLP, CBLP, and following ADC operations are iterated to process all the N
pixels. Here, it is assumed that N is a multiple of Ncol, which is the number of columns in the
BCA. An 8-bit precision single ramp ADC is employed to convert and store the value of VC
for suﬃcient resolution to recognize the templates. The ADC conversion can be processed
in parallel with other DIMA stages. The resulting digital MD values are compared with a
temporary minimum in a digital logic to update the temporary minimum and its pointer
address (u) in (2.41). The entire process terminates when 128 such updates are completed.
2.5.5.5 System Performance (PDET )
System simulations are performed with the Monte-Carlo method to measure the detection
probability PDET of template using the behavioral models in (2.17)-(2.23) with parameters
from Table 2.2. Eight face image samples from the set of 128D images are randomly chosen
as the template P , and the PDET per template P is obtained by counting the number of
correct detections out of multiple trials. The overall PDET is obtained by averaging the 8
P -speciﬁc PDET values. The PDET is also measured with N = 64 and 16, which are achieved
by sub-sampling the images D and P with N = 256.
The conventional system with ∆VBL(B) = 250mV and VWL = 1 achieves PDET = 1 with
N = 16, 64, and 256. The system simulations with the Monte-Carlo method shows that the
DIMA achieves PDET = 1 with VWL ≥ 600mV and N = 256, and achieves higher PDET with
higher N and VWL as shown in Fig. 2.19. It can be concluded that the non-idealities from
the DIMA is eﬀectively compensated by the inherent error resiliency of the MD algorithm
45
00.2
0.4
0.6
0.8
1
1.2
300 400 500 600 700 800 900 1000
P
D
ET
VWL [mV]
MC with N = 256
N = 64
N = 16
Prediction N= 256
N = 64
N = 16
Figure 2.19: Probability of detection (PDET ) of DIMA from system simulations and
prediction model (2.40).
in this condition.
The predicted PDET by (2.40) is slightly higher than that from Monte-Carlo simulations
as shown in Fig. 2.19 because the eﬀects from non-linearity of FR and the comparator oﬀset
are excluded in this prediction. The predicted PDET shows the highest accuracy with the
largest L = 256, where the maximum error magnitude of the predicted PDET is 0.011.
This is because ηC approaches a Gaussian distribution by the central limit theorem as N
increases [50]. The accuracy of prediction decreases with a smaller value of N , and the
maximum error magnitude of the predicted PDET is 0.066 and 0.067 with N = 16 and 64,
respectively.
2.6 Conclusion
This chapter describes the DIMA's four processing stages: (1) FR, (2) BLP, (3) CBLP, and
(4) thresholding, based on the common functional ﬂow of ML algorithms. DIMA's robustness
in the low-SNR regime was demonstrated with design principles and measurement results
of prototype IC. The design guidelines and techniques provided in this chapter are applied
46
to two prototype ICs, which will be introduced in the following chapters. This chapter also
proposed energy, delay, and behavioral models of non-ideal behavior from analog circuitry.
These models will be employed to estimate an application level's accuracy and energy and
delay beneﬁts in various algorithms/applications in the rest of this dissertation.
47
Chapter 3
DIMA PROTOTYPE INTEGRATED CIRCUITS
Chapter 2 demonstrated DIMA's versatility by enabling various VD computations and
demonstrating signiﬁcant energy and delay beneﬁts in simulations. This chapter realizes
the DIMA concept as two prototype ICs: (1) multi-functional DIMA [39], and (2) random
forest (RF) DIMA [54] in a 65 nm process. This chapter begins with the description of multi-
functional DIMA, which supports four diﬀerent algorithms: support vector machine (SVM),
template matching (TM), k-nearest neighbor (k-NN), and matched ﬁlter (MF). Design de-
tails including chip architecture, circuit techniques, and measured results are provided in the
following sections. Then, those design principles are extended to enable the RF algorithm,
which is an ensemble of many decision trees. The DIMA prototype IC for RF is described in
the last section of this chapter. To the best of our knowledge, this is the ﬁrst IC realization
of the RF algorithm.
3.1 Multi-Functional DIMA Architecture
The following three sections present a multi-functional DIMA in a 65 nm CMOS process. The
chip architecture (Fig. 3.1) comprises a DIMA core (CORE), a digital controller (CTRL),
and an input register to stream in the operand P . The CORE includes a 512×256 BCA,
the conventional SRAM read/write circuitry, the BLP and CBLP, and four 8-bit single-slope
ADCs. The RDL is embedded in the digital CTRL.
The SRAM bitcell was custom-designed following standard design rules as the memory
compiler did not allow modiﬁcations to be made to the peripheral circuitry. As a result,
48
Precharge
4:1 column mux
Sense AMP / Write driver
R
C
FG
 w
o
rd
 
se
ri
al
-i
n
Y-DEC.
16 X 128-b streamed Input buffer reg.
BL processor
(BLP)
(mult / comp
 / abs)
Cross BL processor (CBLP)
D
ec
is
io
n
sc
an
-o
u
t
N
o
rm
al
re
ad
/w
ri
te
ci
rc
u
it
ry
Inst.
set reg.
512 X 256-b
6T SRAM
bitcell array
Bitcell
Bitcell
Bitcell
BitcellF
u
n
ct
io
n
al
 W
L 
d
ri
ve
r
4 X 256-b replica bitcell array 
ADC[0:3]
R
ec
o
n
fi
g.
an
al
o
g
p
ro
ce
ss
o
r
X
-D
ec
. &
 P
u
ls
e 
ge
n
.
X
-D
ec
. &
 P
u
ls
e 
ge
n
.
Thres
-hold
P
(C)BLP
CTRL
Main
CTRL
ADC
CTRL
R/W
CTRL
BL processor
(BLP)
(mult / comp
 / abs)
Ex
ec
u
te
Fetch
MODE
CTRL
(C)BLP EN
R/W CTRL
EN
Sampler
[0:3]
WBL[0] WBL[255]
B
L 0
B
LB
0
B
L 2
5
5
B
LB
2
5
5
CORE
Figure 3.1: The multi-functional DIMA architecture.
the horizontal and vertical dimensions of bitcell were approximately 1.7× larger typical
foundry-provided bitcells dimensions [55]. The column muxing ratio was chosen to be L = 4
to maximize the throughput for the standard SRAM read. Thus, the SA (and write driver)
is shared by four columns.
An 8-bit precision is chosen for D and P in order to maintain almost the same accuracy
as ﬂoating point [5,56,57]. Based on the design principles in Section 2.3.1, parameter values
VWL = 0.65 V and T0 ≈ 250 ps were chosen resulting in the longest PWM-WL pulse width
T3 < 0.4RBLCBL was achieved thereby ensuring suﬃcient linearity and avoiding destructive
read. Sub-ranged FR was employed to access 4 MSBs and 4 LSBs simultaneously from
adjacent column pairs. As mentioned earlier, tuning capacitors are attached to the BLs to
realize the 16:1 capacitance ratio for the column pair. The area overhead due to DIMA
circuitry in the CORE was found to be 25%.
The serially provided reconﬁguration word RCFG initializes the local controllers in the
CTRL. Four slow but energy-eﬃcient 8-bit single-slope ADCs [58] are employed to convert
the analog CBLP outputs in parallel.
49
3.2 Multi-Functional DIMA Operations
3.2.1 Timing
The chip operations are sequenced via the CTRL which operates with a master 1 GHz CLK
thereby providing a 1 ns time resolution for generating various control signals synchronized
to CLK. Self-timed control [59, 60] can improve the throughput of both normal read and
DIMA operations but we chose synchronous design for simplicity.
The timing diagram in Fig. 3.2(a) describes the series of ten events that occur during a
word-row period, i.e., when processing a single word-row of B = 8 bits through the FR, BLP,
and CBLP stages. The ﬁrst event in both MD and DP modes is the BL precharge. Next, the
FR, BLP, and CBLP stages are sequentially executed to generate the corresponding outputs
VBL, VB, and VC , respectively. One diﬀerence between the two modes - the MD mode
requires transferring P from the input buﬀer into replica BCA before initiating FR. On the
other hand, the DP mode needs to make this transfer of P to the mixed-signal multiplier
before initiating BLP and requires additional delay in the CBLP stage to support sub-ranged
processing. The last event samples the CBLP output to generate the input voltage for the
ADC. Each event requires an integer number of CLK cycles which are estimated via post-
layout simulations. The prototype IC provides for tunability to allow additional CLK cycles
to be introduced for each stage in order to accommodate deviations from the nominal process
corner.
Figure 3.2(b) shows timing diagram for processing 256-dimensionalD and P vectors. Each
word-row consists of 128 8-b words (though Ncol = 256, the use of sub-ranged read results
in a 128 dimensional vector) generate the CBLP output VC . Two word-rows are processed
consecutively and their CBLP outputs VCs are sampled and charge-shared (Merge_SP step)
to aggregate 256 scalar elements of the 256-dimensional vectors D and P . This is followed
by using an available 1-of-4 ADCs to digitize the analog CBLP output into an 8-b word.
The single-slope ADC conversion takes 140 CLK cycles for both MD and DP modes, which
is approximately 5.6 MD word-row periods. However, this slow conversion rate is not an
issue as the ADCs operate in parallel. The ADC output is further processed in the RDL
50
BL  precharge ··· ···
Input buffer fetch ··· ···
Replica BCA write ··· ···
Replica BCA read ··· ···
BCA read (V BLM , V BLL ) ··· ···
V BLM & V BLL  merge ··· ···
BLP (V B) ··· ···
Dump V B  on MSB & LSB_Rails ··· ···
MSB & LSB_Rail  merge ··· ···
Sample ··· ···
Duration (ns) 1 8 4 2 5 3 2 1 8 4 1 4 2 8 7 2 1 2 1 4 2
word-row period
(MD mode)
word-row period
(DP mode)
FR
CBLP
(a)
Merge_SP[0,1]
Merge_SP[2,3]
ADC_EN[0]
ADC_EN[1]
ADC_EN[2]
ADC_EN[3]
Threshold_EN
16 pixel
1
6
 p
ix
el
ADC[2] on: converting  vec
word
-row0
ADC[0] on: converting word-rows0&1
ADC[1] on: converting word-rows2&3
ADC[3] on
1st 128 
words
2nd 128 
words
256 words
word
-row1
word
-row2
word
-row3
word
-row4
word
-row5
word
-row6
word
-row7
word
-row8
word
-row9
(b)
Figure 3.2: DIMA timing diagrams for processing: (a) a single word-row, and (b) multiple
word-rows (dotted red line: single thread to process 256 words).
block to realize the thresholding operation (Threshold_EN step).
3.2.2 Algorithm and Application Mapping
The four tasks (face detection using SVM, gun shot detection using MF, face recognition
using TM, and handwritten digit recognition using k-NN) (see Fig. 3.3) were mapped on
to the prototype IC. These tasks cover both binary and multi-class (4-class and 64-class)
scenarios, requiring both MD and DP modes of operations, and processing of both image
and sound data sets [53, 61, 62] as summarized in Table 3.1. Table 3.2 deﬁnes the set of
operations per stage of CORE that can be chosen using the RCFG word during one word-
row period. In this process, the prototype IC is able to realize the four diﬀerent algorithms
as shown in Table 3.3.
51
|Dpositive|
|Dnegative|
Bitcell  array (D)
DP
Input buffer (P)
1st cycle
2nd cycle
IN1-IN2+b
Threshold:
Linear
combiner
sign
bit
vs.
D
ec
is
io
n
on-chip
IN =
 ΣnDn Pn
(a)
DP
Input
buffer (P)
IN-Th
Bitcell  array (D)
gunshot sound template
vs.
Noisy signal
w/ SNR = 3dB
White Gaussian noise
sign
bit
D
ec
is
io
n
on-chip
Threshold:
Linear
combiner
IN = 
 ΣnDn Pn
(b)
Bitcell  array (D)
INi =
 Σn|Dn -Pn|
MD
Input buffer (P)
1st cycle
Min( )
Candidate 1
(i = 1)
Candidate 2
(i = 2)
D
ec
is
io
n
optimal i
on-chip
Threshold:
Min update
64th cycle
Candidate 64
(i = 64)
(c)
Bitcell  array (D)
MD
Input buffer (P)
Sort
1st cycle
Majority
vote
ADC out
PAD
D
ec
is
io
n
on-chip
off-chip
INi =
 Σn|Dn -Pn| Sort in
ascending
order, and
select top 5
Majority vote 
with top 5
labels
64th cycle
(d)
Figure 3.3: Four inference tasks mapped on to the prototype IC: (a) SVM for face
detection, (b) MF for event detection, (c) TM for face recognition, and (d) k -NN for
handwritten number recognition.
The MF (Fig. 3.3(b)) creates the decision right after single DP processing and thresholding.
On the other hand, SVM requires signed coeﬃcients. Thus, the absolute values of positive
and negative coeﬃcients are stored in the separate rows as shown in Fig. 3.3(a). The positive
and negative products are computed in consecutive cycles, and then compared in the RDL
stage to obtain the sign of the DP. TM (Fig. 3.3(c)) and k-NN (Fig. 3.3(d)) make decisions
after comparison (to ﬁnd minimum) or majority voting across multiple candidates. All the
data sets are processed fully on-chip except for k-NN, where the last step of majority voting
was done oﬀ-chip.
3.2.3 Design Techniques for Re-conﬁgurability
Enabling the computation of multiple functions via mixed-signal circuitry with stringent
column pitch-matching constraints is very challenging as shown in Fig. 2.10. Two design
52
Table 3.1: Measured data set for four applications [53,61,62].
Table 3.2: Multi-functions in each processing stage.
Stage configurations
FR
① Normal read
② Digital to analog conversion
③ Scalar ADD or SUBT
BLP
① Scalar MULT
② BL-wise sampling
③ Absolute value
CBLP
① Aggregation
② Weighted aggregation
RDL
① MIN or MAX
② Linear combination
③ Send outside chip
53
Table 3.3: Conﬁgurations of each stage to enable four algorithms (the operations
corresponding to numbers are described in Table 3.2).
Mode Algorithm FR BLP CBLP RDL
DP
SVM ② ①, ② ② ②
MF ② ①, ② ② ②
MD
k-NN ②, ③ ②, ③ ① ③
TM ②, ③ ②, ③ ① ①
techniques are presented in order to support MD and DP modes in the BLP.
• Circuitry sharing with reconﬁguration: Figure 3.4(a) shows the BLP (and CBLP) circuit
implementation to support the DP and MD modes. In the MD mode, an analog comparator
and mux implements the max operation in (2.14) to compute the absolute value |D − P |.
In the DP mode, the comparator is bypassed and mux always chooses BLB. In both modes,
the capacitor C in the red box of BLP multiplier (see Fig. 3.4(b)) is shared with the CBLP
to realize the BL-wise sampling capacitor CS in Fig. 2.12.
• Sub-ranged processing : the charge-based multiplier in Fig. 2.11(b) employs unit capaci-
tors to meet the column pitch constraints necessitating sequential processing of multiplicand
bits (pi) and thereby limiting the throughput. Sub-ranged multiplication alleviates this
problem by employing two 4-b MSB and LSB multipliers operating in parallel (Fig. 3.4(a))
and dump their VBs on the MSB_Rail and LSB_Rail, respectively, while φ1 = 1. Then,
the switch φ1 opens to make the capacitance of the MSB_Rail 16× larger than that of
the LSB_Rail. Subsequent charge sharing of the the MSB_Rail and LSB_Rail is done by
setting φ1 = 0 and φ2 = 1 to generate the ﬁnal output VC . In this manner, the sub-ranged
54
Reconfig.
Mult.
MSB 4b
OUT
IN
W
B
L
Reconfig.
Mult.
LSB 4b
OUT
IN
MUX_OUT
W
B
L
Ø Ø
Mult CTRLs
DP_MODE
B
L
M
SB
B
LB
M
SB+ -
01
COMP
OUT
EN
MUX
W
B
L M
SB
W
B
L L
SB
O
U
T M
SB
O
U
T L
SB
BL processor (BLP)
COMP_EN
MD_MODE
MSB_Rail
LSB_Rail
Cross BL processor (CBLP)
C
o
lu
m
n
 p
ai
r[
1
]
ø2ø1
pitch matched to SRAM bitcell
Prech
VC
C
o
lu
m
n
 p
ai
r[
0
]
O
U
T M
SB
O
U
T L
SB
O
U
T M
SB
O
U
T L
SB
CTRLs
4
 S
am
p
le
rs
&
 4
 A
D
C
s
C
o
lu
m
n
 p
ai
r[
1
1
1
]
O
U
T M
SB
O
U
T L
SB
C
o
lu
m
n
 p
ai
r[
1
1
2
]
O
U
T M
SB
O
U
T L
SB
C
o
lu
m
n
 p
ai
r[
1
1
7
]
O
U
T M
SB
O
U
T L
SB
(a)
VPRE
C
C
C
ø3,2
VPRE
VPRE
ø3,1
VPRE
C
ø3,3
C
VPRE
ø3,0
MSB(LSB)_Rail
WBL
ø2,0
ø2,X
ørail
MUX_OUT
VPRE
WBL p0 p1
ødump(_MD)
ø2,X
ø2,0
ø2,1
p3p2
ø2,2
ø2,3
ø3,0
ø3,1
ø3,2
ø3,3
ørail
ødump_MD
ø2,3
ørail
D
P
 m
o
d
e
M
D
 m
o
d
e
0
ødump
ødump
ødump
ødump_MD
ø2,1
ø2,2
ø2,3
(b)
Figure 3.4: BLP and CBLP implementations for reconﬁguration: (a) overall structure, and
(b) re-conﬁgurable charge-based multiplier for 4-b MSB (or LSB) and its enabling signals
(only red marked area is used in MD mode).
55
Figure 3.5: Multi-functional DIMA die micrograph.
processing improves the throughput of DP mode by a factor of two.
3.3 Measured Results of Multi-Functional DIMA IC
This section shows measured results of the prototype IC, packaged in an 88-pin QFN as
shown in Fig. 3.5 and summarized in Table 3.4.
3.3.1 Accuracy of FR
The measured results of sub-ranged FR of 8-b word D is shown in Fig. 3.6(a). The BL
voltage drop ∆VBL generated by FR for all 256 values of D was measured at the output of
the column mux along the normal SRAM read path. The integral non-linearity (INL) was
found to be less than 0.87 LSB. The sudden jump when D transitions from 7 to 8 is due to
the large change in the average transition time of the WL pulses. The value of worst-case
γi (γ0) was estimated by comparing the slope of the curve between D = 0 and D = 1,
and between D = 0 and D = 8. Similarly, the value of worst-case ρi(VBL) (ρ3(0.5V)) was
56
Table 3.4: Multi-functional DIMA prototype IC summary.
Technology
Die size
CTRL operating freq.
SRAM capacity
Bitcell dimension
Supply voltage
65 nm CMOS
1.2 mm × 1.2 mm
1 GHz
16 KB (512 × 256-b)
CORE: 1.0 V, CTRL: 0.85 V
2.11 × 0.92 um
2
estimated by comparing the slope of the curve between D = 6 and D = 7, and between
D = 14 and D = 15. These worst-case values of γi and ρi in (2.9) were found to be less than
41% and 37%, respectively.
The variation in ∆VBL due to δi was measured via following 4-step process: (1) store the
same data across the entire BCA, (2) access the ﬁrst word-row via with FR, (3) the ∆VBLs
are aggregated via the CBLP and sampled to generate output voltage ∆VC (= VPRE − VC),
and (4) repeat this process for all 128 word-rows. The data DM = DL = 7 is chosen to
generate the worst-case (maximum) variation in ∆VBL, as in this case, the BL is discharged
by a single SRAM bitcell. Fig. 3.6(b) shows that the variation in ∆VBL has a standard
deviation σ = 2.5 mV, which is only 0.6% of the dynamic range (410 mV) of CBLP output
VC in this test mode. This small deviation arises because of the aggregation eﬀect of CBLP
as described in Section 2.2.3. In Section 3.3.3, we show that these errors result in negligible
impact on the accuracy of inference tasks considered in this paper.
3.3.2 Accuracy of CORE Output
We characterize the accuracy of the CORE output that includes FR, BLP and CBLP stages.
This is because it is diﬃcult to isolate the BLP and CBLP outputs from each other. The
57
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0
0.1
0.2
0.3
0.4
0.5
0.6
0 2 4 6 8 10 12 14 16
IN
L 
[L
SB
]
Δ
V
B
L
[V
] 
fr
o
m
 F
R
4-b DM
INL
ΔVBL
0.14
0.16
0.18
0.2
0 4 8 12 16
4-b DL
(a)
210
214
218
222
226
230
0 100 200 300 400 500
Δ
V
c
[m
V
]
Row address
DM = DL = 8 for all columns
 = 2.5 mV
(b)
Figure 3.6: Measured FR accuracy of 8-b D: (a) sub-ranged read, and (b) impact of
spatial variation on ∆VC .
58
00.05
0.1
0.15
0.2
0.25
0.3
0.35
0 50 100 150 200 250 300
∆
V
c
[V
]
8-b D
250
50
100
150
200
0
8-b P
(a)
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0 50 100 150 200 250 300
V
c
[V
]
8-b D
0
50
100
150
200
250
8-b P
(b)
Figure 3.7: Measured CORE analog output with 8-b operands D and P in the: (a) MD
mode (
∑ |Di − Pi| ∝ VC), and (b) the DP mode (∑DiPi ∝ ∆VC).
CORE output shown in Fig. 3.7 was measured for the same data (D and P ) being stored in all
the columns, and for the DP and MD mode computations. The measured error magnitudes
at VC (from ideal linear trend) in the DP and MD modes are < 18 mV and < 28 mV with
the mean of 4 mV and 8 mV, respectively, over all the combinations of (D, P ). Though
these errors are signiﬁcantly larger than the chosen target resolution Vres = 1 mV, these
errors are easily masked by choosing the hyper-parameters of the inference task to provide
a suﬃciently large decision margin as will be discussed next. In addition, it is possible to
training the engine in presence of these errors to obtain circuit-optimized hyper-parameters.
3.3.3 Energy, Delay, and Accuracy
We consider the energy consumption of the CORE block only because its energy scales up
with the number banks and the BCA size. In contrast, the energy of the CTRL block is
amortized over the number of banks and the BCA size. We measured the CORE decision
energy and decision accuracy for SVM (face detection, binary class) and TM (face recogni-
tion, 64-class) tasks. The CORE decision energy was normalized by the number of 8-b data
words processed per decision to obtain the energy-per-word as a function of BL swing per
59
0.5
0.6
0.7
0.8
0.9
1.0
1.1
0 10 20 30
C
O
R
E 
en
er
gy
 p
er
 8
-b
 p
ix
el
 [
p
J]
ΔVlsb [mV]
DP mode
MD mode
(a)
0
20
40
60
80
100
120
0 10 20 30
P
ro
b
ab
ili
ty
 o
f 
d
et
ec
ti
o
n
 [
%
]
ΔVlsb [mV]
Face detection (SVM)
Face recognition (TM)
(b)
0
20
40
60
80
100
120
0.5 0.7 0.9 1.1
P
ro
b
ab
ili
ty
 o
f 
d
et
ec
ti
o
n
 [
%
]
CORE energy per 8-b pixel [pJ]
Face detection (SVM)
Face recognition (TM)
∆Vlsb: 0~30mV
(c)
Figure 3.8: Measured BL swing (∆Vlsb) vs. energy vs. probability of correct detection
(Pdet) trends: (a) ∆Vlsb vs. CORE energy, (b) ∆Vlsb vs. Pdet, and (c) CORE energy vs.
Pdet.
LSB ∆Vlsb = ∆VBL(DM = 15)/15 in Fig. 3.6(a).
Figure 3.8(a) indicates that CORE energy reduces at a rate of 0.2 pJ (0.4 pJ) per 20 mV
for binary or DP mode (64-class or MD mode) task. Furthermore, the greater slope of the
energy vs. ∆Vlsb plot for the MD mode and its higher energy consumption for ∆Vlsb > 15
mV is because the MD mode uses the replica BCA which causes additional voltage drop on
the BL during FR.
The accuracy of the inference task is measured by the probability of detection obtained
by normalizing the number of queries correctly classiﬁed by the total number of queries.
Figure 3.8(b) shows that the binary task is more robust than the 64-class task at the same
∆Vlsb. Furthermore, the binary and the 64-class task achieve > 90% detection accuracy
for ∆Vlsb > 15 mV and ∆Vlsb > 25 mV, respectively. Figure 3.8(c) plots the probability
of detection against the CORE energy per 8-b pixel and shows that accuracy and energy
trade-oﬀ with each other.
Next, we compare the DIMA prototype with a conventional 8-b digital reference architec-
ture (REF). REF is a 2-stage pipelined comprising an SRAM of the same size as the one
in the DIMA prototype, and a digital block synthesized separately for realizing an SVM
(DP mode) and a TM (MD mode). The energy and delay of the digital block in REF was
60
01
2
3
4
5
6
7
8
9
REF DIMA* DIMA REF DIMA* DIMA
MD mode DP mode
En
er
gy
 p
er
 8
-b
 p
ix
e
l[
p
J]
Energy breakdown
Computation Memory access
×5.5↓
×10.0↓
Figure 3.9: Energy comparison of DIMA with the reference architecture (REF). DIMA*
represents the energy obtained from post-layout simulations while DIMA shows the
measured values.
estimated from post-layout simulations. The energy and delay of the SRAM in REF was
measured from the DIMA prototype in the normal read mode. Figure 3.9 shows the energy
breakdown for REF, DIMA∗ (post layout simulations of the DIMA prototype IC, and the
DIMA prototype IC. The measured energy savings in the DP and MD modes are 10Ö and
5.5Ö, respectively, due to small swing FR, BLP, and CBLP. Furthermore, we ﬁnd that the
DIMA energy estimates obtained from post layout simulations are close to that obtained
from measurements.
Table 3.5 shows that the DIMA prototype IC achieves negligible (≤1%) accuracy degra-
dation for all four tasks as compared to REF. DIMA requires 16× fewer read accesses as
compared to REF for a ﬁxed data volume, resulting in up to 5.8× throughput enhancement.
This is because FR and BLP process data in massively parallel manner (128 8-b words per
access) whereas the normal SRAM mode fetches only 8 8-b words through 4:1 column mux-
ing. Smaller eﬀective ∆VBL and fewer read access reduces data access energy. The low-swing
computations in the BLP and CBLP stages add to the energy savings. The DIMA prototype
IC implements four diﬀerent algorithms achieving better decision accuracy and comparable
energy-delay product (scaled for 65 nm) than single function ICs [5, 41] listed in Table 3.5.
61
Table 3.5: Application level gains of multi-functional DIMA in energy eﬃciency, delay,
accuracy, and comparison with prior arts.
* memory (digital) energy and delay measured from prototype IC (post-layout simulations);
 assumes a 32 bank conﬁguration;
 single function with SRAM memory access cost not included;
** single function with 1b weight vector
3.4 Random Forest (RF) DIMA IC1
This section presents IC realization of a random forest (RF) ML classiﬁer based on the DIMA
platform [54]. The RF classiﬁer [63] is attractive due to its high-accuracy, simple operations
(comparisons), applicability to multi-class problems, and robustness to non-ideal compu-
tations due to its majority voting based-decision. However, realizing an energy-eﬃcient
implementation of the RF algorithm is challenging due to its high data access rate combined
with its highly irregular data access pattern. This section presents an energy-eﬃcient and
high throughput RF classiﬁer IC by employing: (1) deterministic subsampling (DSS ) to
reduce interconnect complexity, (2) a balanced decision tree to regularize memory access
pattern, (3) deeply embedded analog computations [39] in the periphery of an SRAM bitcell
array (BCA) to exploit the inherent algorithmic error tolerance. To the best of our knowl-
1This section is adopted from M. Kang, S. Gonugondla, and N. R. Shanbhag, A 19.4 nJ/decision 364
K decisions/s in-memory random forest classiﬁer in 6T SRAM array, in 47th IEEE European Solid-State
Circuits Conference (ESSCIRC).© 2017 IEEE
62
Table 3.6: Number of required operations in proposed/conventional RF.
- proposed / conventional
 8 bytes per SRAM access assumed
τ(m,n): threshold level of n-th node in m-th tree
p(m,n): pixel index of n-th node in m-th tree
Pm: [pm,1, pm,2,. . . pm,N ]
RSS : Random subsampling by sample pattern Pm
x(p(m,n)): p(m,n)-th pixel of input image X
c(m,l): label corresponding to l-th leaf node in m-th tree (m: 1 ∼M , n: 1 ∼ N , l:
1 ∼ N+1)
edge, this is the ﬁrst IC implementation of the RF algorithm as there are the only FPGAs,
GPUs, and multi-core processor implementations of the RF algorithm [63]. These fail to
take advantage of the opportunities aﬀorded by analog computations.
3.4.1 Background
This section explains the RF algorithm and its implementation challenges.
3.4.1.1 RF Algorithm
The RF algorithm (Fig. 3.10(a)) consists of M decision trees. The m-th tree processes data
obtained by random subsampling (RSS ) the input image (X) using a pseudo-random pattern
vector Pm. The n-th node in the m-th tree compares x(p(m,n)), which is the pixel (or feature)
63
indexed by p(m,n), with a threshold τ(m,n) to obtain a node-level binary decision q(m,n). Either
the left or right branch is taken based on q(m,n). This process is repeated until a leaf node is
reached. The label c(m,l) corresponding to the l-th leaf node is the tree-level decision. The
ﬁnal decision is obtained by majority-voting the M tree-level decisions.
3.4.1.2 Implementation Challenges
Two diﬀerent architectures can be considered to implement the RF algorithm: serial and
parallel architectures. A serial architecture needs to process nodes sequentially resulting in
large delay and requires reading of two 11-b (for a 16 KB array) child node addresses per
node, which takes roughly half of the storage space. On the other hand, a fully parallel
architecture computes all q(m,n) in parallel and uses these to address a look-up table (LUT)
to obtain c(m,l). Doing so requires a large number of memory accesses, e.g., seventy eight
8-b bytes per tree (Table 3.6), which in turn limits the achievable throughput and energy
eﬃciency. Additionally, a complex (i.e., 256:1 with 16Ö16 image X) crossbar is needed to
route the pixel indexed by p(m,n) from X for comparison.
3.4.2 The Proposed RF Algorithm and Architecture
This section co-optimizes the algorithm and architecture to achieve energy and throughput
beneﬁts.
3.4.2.1 The Proposed RF Algorithm
The modiﬁed RF algorithm (Fig. 3.10(b)) employs a ﬁxed-pattern deterministic subsampling
(DSS ) step prior to RSS to solve the crossbar problem mentioned above. A 4:1 DSS factor
64
Majority
voter
Input (X)
chosen path
in each tree
leaf nodes
tree 1 tree 2 tree M
RSS RSS RSS
label1
decision
label2 labelM
P1 PMP2
yes no
>
τm,n
nodem,n
x(pm,n)
(a)
Majority
voter
balanced tree 1
RSS RSS RSS
label1
decision
label2 labelM
PMP2
4:1 DSS
P1
balanced tree 2 balanced tree M
Input (X)
(b)
Figure 3.10: Random forest (RF) algorithm: (a) conventional, and (b) proposed with
deterministic subsample (DSS ).
is chosen to balance the loss in classiﬁcation accuracy by reducing the crossbar complexity.
The complexity of the RSS crossbar is reduced from 256:1 to 64:1 when the input X is a
16Ö16 image. Thus, the precision of p(m,n) is also reduced from 8-b to 6-b. Additionally, the
decision trees are balanced (Fig. 3.10(b)) by ﬁlling some empty nodes in order to regularize
the memory access pattern. The memory access problem is addressed by reducing the
number of memory accesses via in-memory comparison (Fig. 3.11) eliminating the need to
fetch τ(m,n). The Class ADD generator (CAG) generates the address of chosen c(m,l) from
q(m,n) s eliminating the need to fetch all the c(m,l)s. Only 24.5 bytes of data need to be
fetched per tree compared to 78 bytes/tree in the parallel architecture.
65
010
20
30
40
50
60
0 5 10 15 20 25
C
o
m
p
ar
is
o
n
 e
rr
o
r 
ra
te
 (%
)
ΔVBL per LSB (mV)
with 64 trees*
with 4 trees*
*minimum ΔVBL to achieve
classification accuracy ≥ 93%
WLi+1 & RWL1
WLi+2 & RWL2
WLi+3 & RWL2
WLi+0 & RWL0
COMP_EN
q
VWL<VDD
∝ X + T ∝ X – T
 ∝ T + X ∝ T – X
 BLΔV
 BLBΔV
 1 if X > T  (      <       )
 0, otherwise
q = 
 BLV  BLBV
t3
t2
t1
t0
WLi+0
WLi+1
WLi+2
WLi+3
ΔVBL ΔVBLB
COMP_EN
q
>
B
L
B
LB
W
B
L 0
x3
x2
x1
x0
RWL0
RWL1
RWL2
RWL3
WWL3
WWL2
WWL1
WWL0
R
ep
lic
a 
b
it
ce
lls
6
T 
SR
A
M
 b
it
ce
lls
Figure 3.11: In-memory comparison: bitcell column for in-memory comparison of T and X,
and measured accuracy of comparison.
3.4.2.2 Proposed Architecture and Operations
The proposed RF architecture (Fig. 3.12(a)) includes a 512Ö256 SRAM BCA, multi-row
WL driver, 64-b I/O with a 4:1 column mux, DSS input buﬀer to store streamed X, RSS
crossbars, CAG, label ﬁnder, majority voter, and the peripherals for standard read/write
operations. A group of four trees are processed in parallel and 16 such groups are processed
sequentially for a total of M = 64 trees. The classiﬁer ﬁrst: (1) writes the pixel index
register, (2) enables the crossbar, (3) does in-memory comparison enabled by the multi-
row WL driver and analog comparators, (4) sequentially fetches four tree-level labels using
address generated by CAG, and (5) votes by majority in the ﬁnal tree.
66
COMP
         4×256 SRAM replica bitcell array
Normal read/write circuitry
CB
[1:31]
Group 2
Group 42
Group 1
M
u
lt
i-
ro
w
 W
L 
d
ri
ve
r 
w
/ 
ro
w
 d
ec
.
IREG
(pm,n)
512×256 6T SRAM bitcell array
M
a
jo
ri
ty
 v
o
te
r
decision
pm~(m+3),1~31
EN
EN
 c
la
ss
 A
D
D
 g
en
. (
C
A
G
)
q[1:4][1:31]
@Pixel index READ @Label READ
c1,1~32
CTRL, ADD
p1,1~31
τ1,1~31
tree1 tree2 tree3 tree4
Input buffer (X) with DSS
X1,5,…,253
64-b IO
C
TR
L,
 A
D
D
Label
finder
c2,1~32
p2,1~31
τ2,1~31
c3,1~32
p3,1~31
τ3,1~31
c4,1~32
p4,1~31
τ4,1~31
m
A
D
D
m
~
(m
+3
)
(X(pm,n))
Labelm~(m+3)
6
4
-b
 B
U
S
RSS
DSS
In
-m
e
m
o
ry
co
m
p
ar
is
o
n
RSREG
(x(pm,n))
CORE CTRL
BLs
COMP
CB
[1:31]
IREG
(pm,n)
RSREG
(x(pm,n))
BLs
COMP
CB
[1:31]
IREG
(pm,n)
RSREG
(x(pm,n))
BLs
COMP
CB
[1:31]
IREG
(pm,n)
RSREG
(x(pm,n))
BLs
X2,6,…,254 X3,7,…,255 X4,8,…,256
- IREG: pixel index
   register
- CB: crossbar
- RSREG: RSS
   register
- COMP: analog
   comparators
(a)
In-memory
Comp. 
Label
READ 
Pixel index 
READ
Group 1 Group M
Majority 
vote 
Cross bar
Enable
12 reads
1 MR-read
2 reads
decis ion
Pixel index 
READ
tree1
tree2
tree3
tree4 P4,1~31
P1,1~31
P2,1~31
P3,1~31
3 reads 3 reads 3 reads 3 reads
Label
READ 
32 bits including Label1
1 read 1 read
left
half
right
half
left
half
right
half
row i
row (i+1)
32 bits including Label2
32 bits including Label3
32 bits including Label4
Replica cell
Write
1 MR-read
2 reads
(b)
Figure 3.12: Proposed RF: (a) architecture, and (b) timing diagram.
3.4.2.3 In-Memory Comparison
In-memory comparison requires the 8-b thresholds τ(m,n) (T in Fig. 3.11) and the indexed
pixels x(p(m,n)) (X in Fig. 3.11) to be stored in a column major pattern, i.e., bits of a word
are stored in a column. The comparison begins with the simultaneous application of WL
access pulses with binary-weighted pulse widths to all the rows storing T and X. Here, the
pulse width is proportional to the bit position. Doing so creates a bitline (BL) voltage swing
∆VBLB (∆VBL) proportional to T − X (X − T ) [35, 39]. Linearity of this multi-row read
is improved by reading 4-b MSBs and LSBs separately from adjacent columns followed by
a capacitively weighted charge sharing that assigns 16Ö greater weight to the MSBs. The
WL voltage is reduced (e.g., 0.65 V) to prevent destructive read and improve the linearity
further. Storing the X in the replica bitcell array allows fast writing through a separate write
67
Bitcell
Array
Input buffer &
Pixel index register &
Cross bar
D
ig
it
al
 C
TR
L
R/W  
Test
block
Decision
Replica bitcell array
M
R
-W
L 
d
ri
ve
r&
 P
u
ls
e
 g
e
n
Bitcell
Array
1.2 mm
1
.2
 m
m
6
4
-b
 b
u
s
Analog comparators
M
R
-W
L 
d
ri
ve
r&
 P
u
ls
e
 g
e
n
M
R
-W
L 
d
ri
ve
r&
 P
u
ls
e
 g
e
n
Figure 3.13: RF DIMA die micrograph.
BL (WBL) and wordline (WWL) by eliminating the overheads of slow write operation into
normal BCA. The BLs feed into analog comparators to generate node-level decisions (q).
In-memory comparison is an intrinsically and massively parallel operation as it processes all
128 8-b words in parallel from 256 columns whereas conventional memory fetches only 64
bits (= 8 words) per read access when the sense ampliﬁer is shared across four columns. In
addition, multi-row read saves energy by accessing 4 bits per precharge.
3.4.3 Chip Measured Results
The in-memory RF classiﬁer is implemented in a 65 nm CMOS process (chip micrograph in
Fig. 3.13 and summarized in Table 3.7) to prove the application-level's beneﬁts.
68
Table 3.7: RF prototype IC summary.
0
20
40
60
80
100
0
10
20
30
40
50
60
0 5 10 15 20 25
C
la
ss
if
ic
at
io
n
 e
rr
o
r 
ra
te
 (
1
-P
D
ET
) 
(%
)
C
o
re
 e
n
e
rg
y 
p
e
r 
d
e
ci
si
o
n
 (
n
J)
ΔVBL per LSB (mV)
Proposed Energy Conv. Energy
Proposed Accuracy Conv. Accuracy
Figure 3.14: Energy vs. error rate w.r.t ∆VBL with 64 trees (eight-class traﬃc sign
recognition).
69
3.4.3.1 Component-Level Accuracy Characterization
Measured in-memory comparison results show (Fig. 3.11(b)) the comparator error rate in-
creasing from 1.6% to 14.5% as ∆VBL reduces from 25 mV to 5 mV. The RF algorithm with
64 trees needs an error rate of less than 9.5% at comparator output q to avoid a discernable
eight-class classiﬁcation accuracy loss. Four trees tolerate only 4% error thereby restricting
further reduction in ∆VBL.
3.4.3.2 Application-Level Accuracy, Energy, and Throughput
Measured results (Fig. 3.14) of energy vs. accuracy trade-oﬀ for the eight-class traﬃc sign
recognition with 64 trees show the proposed IC achieves a 3.1Ö energy savings over the
conventional architecture (SRAM + digital processor). The energy of the conventional ar-
chitecture is obtained via post-layout simulations of the digital blocks and read access energy
measured from the prototype IC. This energy savings come from multi-row read, in-memory
comparison, and low-complexity crossbar. Fewer memory accesses also reduce the decision
delay by 2.2Ö over a conventional architecture, thereby providing a 6.8Ö lower energy-delay
product (EDP) at the same accuracy of > 93% as the conventional architecture. The proto-
type IC achieves a throughput of 364 K decisions/s and energy eﬃciency of 19.4 nJ/decision,
achieving at least 5.6Ö smaller EDP compared to prior multi-class classiﬁer ICs [5, 64] as
listed in Table 3.8.
3.5 Conclusion
This chapter describes two DIMA prototype ICs: (1) multi-functional DIMA for SVM, TM,
k-NN, and MF, and (2) single-function DIMA for RF. Potentially, more algorithms can be
70
Table 3.8: Application level gains of RF DIMA in energy eﬃciency, delay, accuracy, and
comparison with prior arts.
* memory (digital) energy and delay measured from prototype IC (post-layout simulations)
 assumes a 32 bank conﬁguration
 single function with SRAM memory access cost not included
** single function with 1b weight vector
covered by simply modifying or adding functionality in each processing stage. Measurement
results of multi-functional DIMA IC demonstrate up to 31× EDP reduction as compared to
the conventional digital architecture optimally designed for each algorithm. Furthermore, the
EDP beneﬁt is expected to be even higher (up to 56×) in multi-bank scenarios by sharing
the controller overhead over many banks. The prototype IC of RF also achieves a 3.1Ö
energy savings and 2.2Ö speed-up at the same time providing a 6.8Ö lower EDP at the same
accuracy of > 93% compared to conventional digital architecture, leading to a throughput
of 364 K decisions/s and energy eﬃciency of 19.4 nJ/decision for eight-class traﬃc sign
recognition problem. A trade-oﬀ between the energy and accuracy was also observed in
both ICs by controlling the ∆VBL. This indicates that there is a potential to push DIMA's
energy savings further by statistical error compensation techniques. Speciﬁcally, the ML
coeﬃcients can be re-trained to compensate the deterministic error patterns of non-ideal
analog circuity. On the other hand, the DIMA with an on-chip trainer will be able to
compensate time-dependent noise sources. In particular, the feasibility of multi-functional
DIMA indicates the potential to realize programmable DIMA instruction set architecture
(ISA), which will be explored in Chapter 5.
71
Chapter 4
MAPPING INFERENCE ALGORITHMS TO DIMA
In Chapter 3, it was shown that the multi-functional DIMA enables four algorithms with
similar functional ﬂow. In this chapter, the DIMA is applied to two algorithms with more
complex functional ﬂow as follows: (1) convolutional neural network (CNN) [36] - deep neural
network based on convolutional operations, and (2) sparse distributed memory (SDM) [37,38]
- a computational model inspired by the human brain. The DIMA-based CNN demonstrates
that the error-aware training can compensate the non-ideal behavior of DIMA eﬀectively. On
the other hand, the SDM generates the ﬁnal decision via majority-voting across decisions
from many weak classiﬁers. This ensemble nature makes the system more robust to the
hardware noise allowing further low-SNR processing to achieve aggressive energy savings.
In addition, the SDM algorithm is modiﬁed to maximize the beneﬁt from DIMA-based
architecture.
4.1 Convolutional Neural Network (CNN)1
Convolutional neural networks (CNN) is one of the most widely used pattern recognition
algorithms due to its state-of-the-art performance in computer vision applications such as
handwriting recognition and face detection [65, 66]. However, the CNN requires complex
interconnect, massive inner product computations, and access to a large data volume. GPU
[65] and FPGA-based [66] implementations were proposed recently in order to speed up CNN
computation over a purely software implementation. It is well known that the energy and
throughput of general-purpose computing platforms such as GPU and FPGA are at least
1This section is adopted from M. Kang, S. Gonugondla, M.-S. Keel and N. R. Shanbhag, An energy-
eﬃcient memory-based high-throughput VLSI architecture for Convolutional Networks, in 40th IEEE In-
ternational Conference on Acoustics, Speech and Signal Processing (ICASSP).© 2015 IEEE
72
L X L
x1
xm
Wmn
(L-K+1)
(L-K+1)
X
(L-K+1)
(L-K+1)
X
S-
Layer
y1
yN
x'1
x'N
y'1
y'N
Convolution Layer Subsampling Layer
(L-K+1)/2
XL X L
(L-K+1)/2
(L-K+1)/2
X
(L-K+1)/2
(a)
Memory Array
(wmn)
Sense Amplifiers
Register
yn
Digital Processor
xm
(b)
Figure 4.1: Convolutional network (CNN): (a) data ﬂow, and (b) architecture.
one to two orders-of-magnitude worse than dedicated VLSI implementations [35].
This section presents a dedicated VLSI architecture based on DIMA, where computation
is embedded inside the memory array. Our DIMA-based CNN implementation [36] is shown
to provide a 24.5× reduced EDP as compared to the conventional system.
4.1.1 Background: CNN
A CNN is a multi-layer network (see Fig. 4.1(a)) consisting of interleaved convolutional layers
(C-layers) and sub-sampling layers (S-layers). The C-layer is computationally intensive and
is described as follows:

y1
...
yN
 = φ


w11 · · · w1M
... · · · ...
wN1 · · · wNM
 ∗

x1
...
xM
+

b1
...
bN

 (4.1)
where xm (m = 1, ...,M) and yn (n = 1, ..., N) are the L × L input and (L − K + 1) ×
(L − K + 1) output feature maps, respectively, wmn is a K × K kernel function, ∗ is a
convolutional operator, and bn is a bias term. Here, φ is a non-linear, typically sigmoid,
activation function. The sub-sampling layer (S-layer) simply reduces the dimensions of the
input feature map x′n. As indicated in (4.1), large data volumes need to be processed by
the CNN. Hence, a memory-based architecture, as proposed in this section, can be highly
73
eﬀective in implementing CNNs.
Figure 4.1(b) shows the block diagram of a conventional CNN system [66], where a conven-
tional SRAM stores the weights wmn, and the input feature map xm is stored in a register
bank. The register contents are updated with the output feature map yn at the completion
of one layer.
The energy consumption to process a single feature map in a conventional system can be
expressed as
Econv = K
2Eread + Eleak + (L−K + 1)2K2EMAC + Ereg (4.2)
where Eread and Eleak = PleakTconv represent the single word SRAM read energy and the
SRAM leakage energy per feature map computation, respectively. Here, Pleak is the leakage
power consumption and Tconv is the time needed to generate a feature map. It is assumed
that a deep-sleep mode is enabled during standby using techniques such as power gating
or lowering the supply voltage for the BCA [49]. EMAC and Ereg are the multiplier and
accumulator (MAC) and register bank energies, respectively.
4.1.2 Proposed DIMA-Based CNN System
4.1.2.1 The DIMA-Based CNN Architecture
Figure 4.2 shows DIMA-based CNN architecture, where Xc is the number of columns in the
SRAM array. The K2 coeﬃcients of wmn are stored in a block of Bw×K2 bitcells, where Bw
is the bit precision of wmn, and each Bw-bit word is stored in one column. In addition, wmn
with the same value of n are horizontally aligned in a single row occupying NK2 columns.
If NK2 > Xc, the wmn are stored in dNK2/Xce rows. The wmns required to compute a
single pixel of yn(x, y) are FR and multiplied with xm provided in the digital domain from
feature map registers.
In the following, we employ D and P to represent speciﬁc values of wmn and xm, re-
spectively, in order to simplify the exposition. In the DIMA, negative values are diﬃcult to
represent in analog domain. An 1's complement representation is employed for D and an
74
COMPB
L 
0
B
LB
 0
Cap mult
Negative-rail
Positive-rail
-
B
L 
(X
C
-1
)
B
LB
 (
X
C
-1
)
SRAM array bank
ADC
Subt
Ø
+
w00 w10 w(M-1)0
w0(N-1) w1(N-1) w(M-1)(N-1)
yn(x,y)
MUX
Cap mult
MUX
+- COMP +-
DEMUX DEMUX
10 10
1010
ADC
(Bw×K
2)
bitcells 
xm xm
bn
D
sD,0 sD,(X -1)c
mux(X -1)cmux0
Figure 4.2: DIMA-based architecture for CNN.
unsigned representation for P . Thus, |D| × P and SD = sign(D) is computed separately.
Here, |D| is computed as follows:
|D| =
Σ
BD−1
k=0 2
kdk ∝ ∆VBLB(D), if D ≥ 0
ΣBD−1k=0 2
kdk ∝ ∆VBL(D), if D < 0
(4.3)
The SD is obtained by using a diﬀerential ampliﬁer with ∆VBL(D) and ∆VBLB(D) as its
inputs, which is then used as a select signal of the multiplexer to select the greater of VBL(D)
and VBLB(D) thereby generating |D| as the output Vmux as shown in Fig. 4.2.
Next, the outputs of multipliers are transferred to a positive or negative rail via demulti-
plexers based on the SD. The rails are shared with multiple columns so that the absolute
values of positive and negative products in (4.3) are added separately via charge-sharing
on each rail. Finally, each value on the rail is converted into a digital number through
two analog-to-digital converters (ADCs), whose outputs are subtracted to generate the a
convolution sum. These steps are repeated dNK2/Xce times if NK2 > Xc to fetch all the
required wmns. Then, the sequentially generated outputs of the subtractor and the bn are
accumulated. The activation function is implemented with three additions and two shifts in
75
00.1
0.2
0.3
0.4
0.5
0.6
0.7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
V
P
R
E
-
V
m
u
lt
(V
)
ΔVBLB(D) (V)
250
200
150
100
50
0
PIdeal value
Simulated value
Figure 4.3: Capacitive multiplier behavioral model (4.4) validation with circuit simulation
in 45 nm (C = 10fF ).
the digital domain as introduced in [66].
4.1.2.2 Inner Products via BLP
The product of ∆VBLB(D) (or ∆VBL(D)) from the multiplexer and a BP -bit digital value P
is obtained via the capacitive multiplier shown in Fig. 2.11(a). The multiplier output ∆Vm
from VPRE is
∆Vm = (0.5)
BPP∆VBLB(D) = αPD (4.4)
where α is a constant depending on TLSB, CBL, and RBL.
The voltage level VPRE − α
∑K2−1
j=0 DjPj corresponding to the inner product between
−→
D =
D0, D1, ..., DK2−1 and
−→
P = P0, P1, ..., PK2−1 can be achieved by charge-sharing the multipliers'
outputs in K2 columns.
76
Table 4.1: Design and model parameters of DIMA.
Parameter Values Parameter Values
VDD 1.1 V f 1 GHz
input L 32 K 5
Bw 8 Bx 6
N C1 : 6 , C3 : 16, F5 : 120 , F6 : 10
f0,..., f4 1, 1.11× 10−2, −5.4684× 10−4, 4.0506× 10−6
c0,..., c4
−9.5× 10−3, 3.2× 10−2, 3.5× 10−4,
−1.7× 10−5, 1.3× 10−7
4.1.3 Energy and Behavioral Models with Circuit Non-Idealities
The analog-intensive DIMA operation is subject to a number of circuit-level non-idealities.
Dominant among these are: (a) non-linearity of the multi-row READ process, which is caused
by voltage-dependent discharge path resistance R, (b) local transistor threshold voltage Vt-
mismatch across bitcells caused by random dopant ﬂuctuations, and (c) non-ideality of analog
multiplication.
The non-linearity of FR was previously modeled in [35] by a polynomial ﬁt as ∆V ′BLB(D) =∑4
k=0 ckD
k, where ∆V ′BLB(D) is a distorted version of ∆VBLB(D), and cks are the ﬁtting
parameters. In this section, we model the Vt-mismatch and non-ideality of the analog mul-
tiplier.
The impact of Vt-mismatch is modeled as a Gaussian distributed random variable as shown
below:
∆V̂BLB(D) ∼ N(∆V ′BLB(D), σ2D) (4.5)
where σ2D is the variance of ∆VBLB due to Vt-mismatch across bitcells corresponding to the
stored value D.
The behavior of multiplier can be captured by a polynomial model with ﬁtting parameters
f0,1,2,3 as follows:
∆Vm = f0∆VBLB(D)P + f1∆VBLB(D) + f2P + f3 (4.6)
These models are employed in the Section 4.1.4 to study the impact of circuit non-idealities
77
00.2
0.4
0.6
0.8
1
1.2
C1 C3 F5 F6 Total
R
e
la
ti
ve
 d
e
la
y
Layers
Conv.
CM
cumm. Conv.
cumm. CM × 4.9↓DI A
DI A
Figure 4.4: Estimated relative delays for CNN.
on application level behavior.
The energy consumption of DIMA to process a feature map is given by
EDIMA = (L−K + 1)2K2EFR + Eleak_DIMA
+(L−K + 1)2K2EMAC_add + Ereg (4.7)
where EFR is the energy consumed to read a single word by the FR. The scaling factor of
the ﬁrst term in (4.7) is larger than that of (4.2). This is because the DIMA reads the
wmns from SRAM again whenever the processing window slides as the analog level from
FR cannot be sustained. The leakage energy Eleak_DIMA = PleakTDIMA, where TDIMA is the
processing time of a feature map by the DIMA, and smaller than Tconv. Thus, Eleak_DIMA
is also smaller than Eleak. The EMAC_add is the energy consumed for the analog multiplier
and charge sharing based adder. This is also smaller than the EMAC due to the operation
with small voltage swing.
4.1.4 Simulation Results
In this section, a handwritten digit recognition with MNIST database [61] is chosen as
an application to measure the performance of DIMA system. All the design and model
78
00.2
0.4
0.6
0.8
1
1.2
Conv. CNN CM. CNN
R
e
la
ti
ve
 E
n
e
rg
y Computation
Register
SRAM Leak
SRAM READ
× 5.0↓
DI A
Figure 4.5: Estimated relative energy consumptions for CNN.
parameters are summarized in Table 4.1. The variant of LeNet5 [65] is employed including
a total of six layers.
Horizontally aligned four banks of the SRAM array with a size of 512×256 bitcells are
employed for the DIMA to store trained kernelwmns. Thus, roughly 40wmns can be aligned
in one row and processed at a time (Xc(= 256× 4)/K2 ≈ 40). The embedded SRAM's IO is
32 bits in the conventional system. The number of multipliers is K2 = 25 in the conventional
system to achieve an area comparable to the DIMA.
4.1.4.1 Model Validation
HSPICE simulations are performed in 45 nm SOI process technology to obtain the behav-
ioral models. The normalized standard deviation (σD/µD) of ∆VBLB in the model (4.5) is
measured by Monte Carlo HSPICE simulations. The minimum value 7% is achieved with
D = 15, and a maximum value 12.5% is obtained with D = 1. The simulated and mod-
eled behavior of capacitive multiplier with C = 10fF is described in Fig. 4.3 with ﬁtting
parameters f0,1,2,3 in Table 4.1.
4.1.4.2 Recognition Accuracy
The wmn are obtained with 60000 training images from MNIST dataset [61] with back
propagation algorithm through roughly 80 iterations. The error rates are measured on
79
MNIST test data set by the system simulations with the behavioral models in following
conﬁgurations: (1) conventional system with a ﬂoating-point numbers, (2) with ﬁxed-point
(Bw and Bx), (3) DIMA system with ﬁxed point and (4) with ﬁxed point numbers and
wmn trained reﬂecting non-linearity of FR. Error rates of 0.8% and 0.85% are achieved
in the ﬁrst and second conﬁgurations, respectively. In the third conﬁguration, the error
rate is degraded to 1.36% due to circuit non-idealities. However, the error rates is 0.87%
in the fourth conﬁguration. It indicates that the non-idealities from DIMA is eﬀectively
compensated by the inherent error resiliency of CNN.
4.1.4.3 Energy and Delay Savings
The SRAM access and multiplication of conventional system require two cycles and one cycle
of clock, respectively, and those can be pipelined. On the other hand, the FR and BLP of
DIMA require a total of 20 cycles, where the DIMA processes Xc = 1024 words reading
and multiplications in parallel. Based on the previous speciﬁcations, each delay from the
convolutional and fully connected layers and cumulative delays are estimated in Fig. 4.4.
The DIMA achieves roughly 4.9× reduced total delays achieving higher throughput in the
memory intensive F5 layer due to the parallel read and computations.
Based on the energy consumption to process a feature map modeled in (4.2) and (4.7),
the energy consumptions to process all the feature maps of layers are estimated in Fig. 4.5.
About 5.0× energy saving in the overall system is achieved mostly by the low-power inner
product computation and the reduced leakage energy due to high throughput.
In conclusion, 24.5× smaller EDP is achieved by the DIMA as compared to the conven-
tional system with 0.02% larger error rate.
4.2 Sparse Distributed Memory (SDM)2
There is much interest in exploring brain-inspired models of computation that can provide
robust system behavior for inference applications while achieving high energy eﬃciency [67
2This section is adopted from M. Kang and N. R. Shanbhag, In-memory computing architectures for
sparse distributed memory, IEEE Transactions on Biomedical Circuits and Systems. © 2016 IEEE
80
70]. The Sparse Distributed Memory (SDM) [71] (see Fig. 4.6) is one such computational
model of the human brain. An SDM can be trained to remember sparse data vectors and
retrieve these when presented with noisy or incomplete versions of the stored vectors. This
is similar to human brain's ability to associate related memory given noisy sensory input by
conceptualizing/categorizing incomplete information [72].
Being a memory array, the SDM input is a 2-tuple (p,d), where p and d are the J-bit
address and K-bit data, respectively (we assume J = K in the rest of this section). In a
SDM, data vectors d are ﬁrst stored (WRITE operation). The address decoder (AD) projects
the J-bit address vector p on to a higher I-dimensional (I  J) space, and then uses this
high dimensional representation s of p as the decoded address into the counter array (CA),
where d is stored in a distributed fashion. In the READ mode, the address p is ﬁrst decoded
by the AD, and the decoded address s used to retrieve the stored data from the CA. The
sparse and distributed nature of data processed and stored in a SDM provides inherent
robustness to noise or imprecision in the input data. The SDM can also be employed in an
auto- or hetero-associative mode to achieve even greater robustness to data errors.
However, a straightforward SDM implementation will consume much energy and will be
slow because the SDM operates in a high (hyper)-dimensional space [71], e.g., typical SDM
parameters are: I = 2 × 103 to 106, J ≥ 256, Bc ≥ 5, where Bc is a bit precision of each
counter in the CA [72,73]. Such an implementation in a 65 nm CMOS process would consume
77 uJ and have a delay of 2 ms per READ. In fact, the dominant (about 80% as shown in
Section 4.2.4) source of energy consumption and delay in the SDM can be attributed to the
AD. Hence, several high throughput architectures for the AD based on SRAM and DRAM
have been proposed. These achieve speed-up by parallelizing the AD using multiple memory
blocks [74]. However, these architectures suﬀer from an inter-block throughput bottleneck.
To remove memory read operation, a shift register-based AD architecture [75] has also been
proposed. However, this architecture suﬀers from large dynamic energy consumption and
occupies a large area compared to memory-based architectures. Mixed-signal AD imple-
mentations [75, 76] employ a current mirror to evaluate the Hamming distances in parallel
thereby achieving high throughput. However, the large content addressable memory bitcell
dimension (i.e., 11 transistors including the current mirror) results in a loss of storage den-
81
A
(I×J)
H
am
m
in
g 
D
is
t.
C
o
m
p
u
ti
n
g
C
(I×JBc)
D
    
AD CA
counter
I × J I SH×JBc J
p d
s
y
(a)
-8
2
5
8
1
-3
-2
3
-2
1
4
3
-1
-8
7
1
2
-5
0
1
0
1
1
0
1
0
0
0
2
2
5
2
4
5
5
1
0
0
1
0
1
0
1
1
0
1
0
0
1
1
0
0
0
1
1
1
0
0
1
1
0
1
1
1100
-16-18-2-21
0101001
    
column
sum
p d
Hamming 
distance
C
   R=3< 0100100
   0><
Binary
thresholding
A
AD
C
A
1
1
0
1
1
0
0
0
0
1
0
0
1
1
10
row
selection
y
s
1
1
0
1
1
0
0
0
03101000 0
Each Bc-bit 
decimal
(b)
Figure 4.6: Sparse distributed memory (SDM): (a) architecture (SH : number of selected
rows), and (b) example of SDM operation with address decoder (AD) and counter array
(CA) (I = 8, J = K = 7, and SH = 3).
82
sity, and the bias currents results in high DC power consumption. The design of SDM is
also implemented by employing resistive memory devices [73].
Implementing the SDM model requires large storage capacity closely integrated with com-
putation. Traditional processor-memory architectures separate low-swing memory storage
functionality from high-swing logic. This separation exists even in the so-called processor-in-
memory architecture [23,25], and is the source of both a throughput bottleneck and energy
consumption. In fact, conventional architectures fail to exploit an important feature of the
SDM [72] - the ability to compensate for hardware noise/errors in addition to noise/er-
rors in the input data. The DIMA preserves the storage density, the conventional SRAM's
read/write functionality, and is well-suited for inference kernels such as SDM which can
compensate for non-deterministic hardware operations.
In this section, we describe an architecture and circuit implementation of a DIMA-based
SDM (DIMA-SDM) [37, 38], which incorporates two proposed techniques (1) DIMA-based
AD (DIMA-AD), and (2) CA with a hierarchical binary decision (CA-HBD). Circuit and
system simulations in a 65 nm CMOS process show that the DIMA-SDM reduces energy and
delay simultaneously by a factor of up to 25× and 12×, respectively, over the conventional
SDM architecture in the auto- and hetero-associative modes with negligible loss in accuracy.
The rest of the section is organized as follows. Section 5.1 provides the necessary back-
ground on SDM, and associative memory. Section 4.2.2 describes the proposed DIMA-AD
and CA-HBD architectures. Section 4.2.3 develops circuit-aware energy, delay, and behav-
ioral models for the entire signal processing chain. Section 4.2.4 presents circuit and system
simulation results demonstrating the performance, delay reduction, and energy savings of
the architecture over the conventional SDM.
4.2.1 Background
This section introduces the necessary background on the topics of SDM [71].
83
CA1AD1
I/M
s1
AD2
s2
CAMADM
d Decision
y
J(Bc+Bx)
CA2
p
B1
B2
BM
sM
I/M
I/M
(a)
Am
 (I/M)×J
p
1
p
2
p
(J-1
)
p
J
  
Adder tree
s1
sI/M
  
s2
sm
  ><
1
0
R
Sense amplifers
I/O ports
Register
BIO
DDD
Cm
(I/M)×(JBc)
Sense amplifers
I/O ports
Register
BcBIO
Bc+BxBc+BxBc+Bx
  
BcBcBc
ADm
  
A
cc
u
m
u
la
to
r
d
  
M
em
o
ry
Lo
gi
c
  
p
ai ci
CAm
(b)
Figure 4.7: The conventional SDM: (a) an M -parallel block architecture, and (b)
architecture of a single block.
4.2.1.1 SDM Operations
Figure 4.6 shows that the SDM accepts as input a 2-tuple (p,d) with a J-bit address p and
J-bit data d. The SDM architecture (see Fig. 4.6(a)) includes: (1) a J-bit address decoder
AD to evaluate the Hamming distance between p and the I, J-bit addresses stored in a I×J
memory array A in the AD, and (2) a counter array CA with a counter and a memory array
C to store the IJ Bc-bit counts.
WRITE Operation: During the WRITE operation, the AD generates an I-bit decoded
row address s = [s1,s2, . . . , sI ] (see Fig. 4.6), as follows:
si = sgn{R−
J∑
j=1
(aij ⊕ pj)}, (i = 1, 2, . . . , I) (4.8)
sgn(x) =
1, if x ≥ 00, otherwise
where ⊕ is the binary EXOR operator, ai = [ai1, ai2, . . . , aiJ ] is the i-th address stored in
A, p = [p1, p2, . . . , pJ ] is the input address, and R is a user-deﬁned radius/threshold.
The decoded row address s and the data d are employed in the CA to update the count,
84
as follows:
cij ←
cij + si, if dj = 1cij − si, otherwise (4.9)
where d = [d1, d2, . . . , dJ ]. Note that the contents of C are updated only when si = 1, i.e.,
only the selected rows are updated.
READ Operation: During the READ operation, AD generates the row address s in the
same manner as in the WRITE operation described by (4.8). Then, the SDM output is read
out as:
yj = sgn(s · cj), (j = 1, 2, . . . , J) (4.10)
where cj is the j-th column vector of C, and y = [y1, y2, . . . , yJ ] is the output word.
4.2.1.2 Associative Memory
In associative memories, the data read is the stored data that is most strongly associated with
the contents of the input rather than a speciﬁc address. There are two types of associative
memories: (1) auto-associative memory, and (2) hetero-associative memory. The SDM can
operate in both modes of associative recall and when it does, the SDM exhibits even stronger
robustness to noise/errors in data.
In the auto-associative mode, the SDM is trained by selecting its input 2-tuple (p,d)
from the training set St = {(t1, t1), (t2, t2), . . .}, i.e., both the address p and the data d are
assigned the same value [77]. On the other hand, in the hetero-associative mode, the SDM is
trained by selecting its input 2-tuple (p,d) from the training set St = {(t11, t12), (t21, t22), . . .},
i.e., p and the d are assigned diﬀerent values.
During the classiﬁcation/decision-making phase, in both associative modes, the SDM is
operated in an iterative manner where initially p = l, where l is a noisy/incomplete version
of the stored data. Then, in subsequent iterations, p is set to the current output of the
85
SDM. Thus, the classiﬁcation phase of the SDM is described as follows:
p[n] =
l, if n = 1y[n− 1], if n > 1 (4.11)
where n is the time index, and l is an initial input. The output y[n] converges to an error-
free/closest version of l that was stored during the training phase. The SDM's auto- and
hetero associative modes can be interpreted as the human brain's ability to extract a pattern
from noise and locate the next pattern in a certain sequence given the current pattern [72].
4.2.1.3 Conventional SDM Architecture
The conventional SDM architecture is multi-block [74] (see Fig. 4.7(a)) in order to enhance
throughput. The multi-block SDM architecture comprises M blocks (B1, B2,. . .,BM) that
operate in parallel, where each block has its own address decoder ADm with memory array
Am of size (I/M)×J bits, a counter array CAm with memory arrayCm of size (I/M)×(JBc)
bits, and decoded row address sm (m = 1, . . . ,M). The memory array A in AD is imple-
mented via SRAMs for high throughput, while the memory array C in CA is implemented
using DRAM, Flash, and PRAM, in order to achieve high storage densities.
The architecture of each block (see Fig. 4.7(b)) indicates that ADm computes the Ham-
ming distance between the input address p and (I/M) stored addresses in Am, while CAm
generates a partial sum, which is then accumulated and thresholded during the READ oper-
ation. The each partial sum requires an additional Bx bits in addition to Bc bits per single
counter in order to prevent overﬂow, where Bx depends upon the sparsity of stored data and
R. Thus, the multi-block SDM architecture requires J(Bc +Bx) global bitlines (GBLs) per
block to transfer partial sums to the decision block.
The conventional architecture in Fig. 4.7(a) has a number of drawbacks. Key among these
are the following:
1. The Hamming distance computation in the ADm requires access to all the memory
locations. The throughput of ADm is limited by the SRAM read out bandwidth. In
86
particular, multiple read out cycles are required to read a single ai in conventional
memory and processor architecture.
This is because conventional SRAMs need to employ column multiplexing, whereby
multiple bitlines (BLs) share a single sense ampliﬁer (SA). Reliability constraints force
the SA and other peripheral circuits to be designed with area that is 4×-to-8× of that
of a bitcell, thereby necessitating column multiplexing. Additionally, there is another
throughput bottleneck due to limited memory I/O port or bus width in von Neumann
architectures [21]. Typically, J/BIO ≥ 4 read outs are required to read the entire data
in single row even in application processors with custom-designed on-chip SRAM [49],
where BIO is the bit width of the SRAM I/O port or the bus width.
2. The additional digital blocks in the AD, such as the adder tree and EXOR gates, lead
to energy consumption and area overhead.
3. Routing the GBLs in the CAm is made diﬃcult because of their large number (J(Bc+
Bx) per block), and because of the small bitcell area (4F
2, F : 1/2 of BL pitch) due to
the use of high density memories [78,79].
Section 4.2.2 describes how these drawbacks of the conventional architecture can be over-
come.
4.2.2 Proposed Architecture
In this section, a DIMA-based SDM (DIMA-SDM) is proposed (see Fig. 4.8(a)) to address
the drawbacks of conventional architecture listed in Section 4.2.1.3. In particular, DIMA-
SDM employs the following key techniques:
 The AD is designed using DIMA (DIMA-AD) (see Fig. 4.8(b)) in order to overcome
its bandwidth limitation and eliminate the use of digital logic.
 The CA is implemented using a hierarchical binary decision (HBD) technique (CA-
HBD) as shown in Fig. 4.8(c) in order to minimize the routing overhead of GBLs.
87
Am
 (I/M)×J
p
s1
sI/M
  
s2
sm
BL processing (BLP)
- Capacitive adder
DDD
Cm
(I/M)×(JBC)
ci
Sense amplifers
I/O ports
Register
BC(BIO+1)
Bc+BxBc+BxBc+Bx
    
BcBcBc
-
+
VR
Nm
Bc+Bx
- Replica bitcell
- Comparator
- NOR gate
C C C
ai2   p2ai1   p1 aiJ   pJ
  
Capacitive adder
  
1 1 1
ym,1 ym,2 ym,J
  
LD
B
d
Vsum
CA-HBDmDIMA-ADm
VPRE VPRE VPRE
ø1 ø1 ø1
ø2 ø2
ai Ni
D
Bc
(a)
WL
 
B
L
B
LB
VPRE
- +-+
aij    pj
aij  
pj  
6T SRAM bitcell
VPRE
Prech
pj  
Replica bitcell
 
B
L
B
LB
VPRE
- +-+
aiJ     pJ
aiJ  
pJ  
6T SRAM bitcell
VPRE
Prech
pJ  
Replica bitcell
aij pj aij     pj+ aiJ pJ aiJ     pJ+
EN
IN+ IN-
OUT
Vref
pj  pj  
aij    
WL
BL BLB
WLP
(b)
  
CA1
LDB
CA(M-1)
CAM
y1 y2 y(J-1) yJ
1 1 1 1
N1 y11 y12 y1(J-1) y1J
y(M-1)1 y(M-1)2 y(M-1)(J-1) y(M-1)J
yM1 yM2
N(M-1)
NM
s
yM(J-1) yMJ
LDB
LDB
GDB
GBLs
NGBLBc+Bx
(c)
Figure 4.8: Proposed SDM architecture (DIMA-SDM): (a) architecture of single block, (b)
AD with DIMA (DIMA-AD) including deeply embedded mixed signal processing units,
and (c) CA with hierarchical binary decision (CA-HBD) (NGBL: number of GBLs).
88
4.2.2.1 DIMA-Based Address Decoder (DIMA-AD)
The proposed DIMA-AD generates the Hamming distance per (4.8) via a three-step process:
(1) FR process generates BL voltages VBL and VBLB that are proportional to the sum aij+pj
over the ﬁeld of real numbers, followed by (2) the use of BLP to compute aij ⊕ pj and (3)
ﬁnally the Hamming distance (via a capacitive adder). These steps are described next.
The FR step begins with the application of access pulses simultaneously to the rows storing
aij and pj such that the pulse width T  RBLCBL, where RBLCBL is the RC time constant
of BL/BLB [35]. This results in a BL/BLB voltage (see Fig. 4.9) given by:
VBL = VPRE − (aij + pj)∆VBL (4.12)
VBLB = VPRE − (aij + pj)∆VBL (4.13)
where ∆VBL = VPRE(T/RBLCBL). A replica bitcell is employed to avoid writing p into the
main array A (see Fig. 4.8(b)).
The second step (BLP) begins with the BL/BLB provided as inputs to diﬀerential com-
parators [80] sized to ﬁt within a single bitcell pitch with an appropriately selected reference
voltage Vref = VPRE −∆VBL/2. Doing so results in binary valued comparator outputs:
XBL = aijpj = sgn(Vdiff,BL) = sgn{0.5− (aij + pj)} (4.14)
XBLB = aij + pj = sgn(Vdiff,BLB) = sgn{0.5− (aij + pj)}
A NOR2 gate that combines the comparator outputs generates aij ⊕ pj as follows:
aij ⊕ pj = sgn{0.5− (aij + pj)}+ sgn{0.5− (aij + pj)} (4.15)
Next, a J-bit capacitive adder (see Fig. 4.8(a)) accepts aij⊕pj from the NOR2 gate output
and employs charge redistribution to compute the summation in (4.8) as follows:
Vsum,i =
1
J
J∑
j=1
(1− aij ⊕ pj)VPRE (4.16)
89
VPRE
Vref
VPRE
Vref
WL
WLP
BL
BLB
pj = 1
pj = 0
pj = 1
pj = 0
VPRE
∆VBL
∆VBL
∆VBL
T
Figure 4.9: Multi-row read functional read (FR) for EXOR operation in DIMA-AD when
aij = 0.
The last step involves an analog comparator that generates the decoded address bit si as
shown below:
si = sgn(Vsum,i − VR) (4.17)
This sequence of operations is repeated I/M times.
Thus, DIMA-AD reads ai in a single read cycle (single precharge) and has ≈ J/BIO times
higher throughput as compared to the conventional AD. Additionally, DIMA-AD is more
energy-eﬃcient than a digital implementation because the capacitive adder employs small
capacitances (i.e., C = 10 fF) and requires a simple switching operation.
4.2.2.2 Counter Array Using Hierarchical Binary Decision (CA-HBD)
The proposed CA-HBD architecture minimizes the inter-block data transfer as shown in
Fig. 4.8(c), where GDB and LDB are global and local decision blocks, respectively. The CA-
HBD architecture requires recording the row access count Ni, i.e., the number of accesses
to each physical address ai in A during the WRITE operation. The row access count Ni is
recorded in an additional column in the CA.
During the READ operation, the LDB of the mth block generates a local binary decision
90
D D
-1
D
Nm
1
y1 yJ
1 1 1
0   1 0   1 0   1
BC+BX
GBL1
y2
D
yJ
1
0   1
1 1 1 1
1
GBLNGBL
Figure 4.10: Global decision block (GDB) to incorporate local decisions (ym,j) with impact
factor Nm.
ym,j and Nm as follows:
Nm =
∑
i∈Hm
Ni (4.18)
ym,j = sgn(
∑
i∈Hm
cij) (4.19)
where Hm is the set of row indices in the m
th block CAm that were selected during READ,
and Nm represents the sum of the row access counts for these rows.
Finally, the GDB (see Fig. 4.10) generates the ﬁnal SDM output bit yj as follows:
yj = sgn{
M∑
m=1
sign(ym,j)Nm}, where (4.20)
sign(x) =
1, if x > 0−1, otherwise
Thus, the GDB weights CAm's contribution ym,j by Nms in order to assign more weight to
those blocks which were accessed more frequently during the WRITE phase. In this manner,
the LDB transmits compressed information to the GDB as ym,js are binary numbers. Thus,
J-bits are required instead of J(Bc + Bx)-bits as shown in Fig. 4.8(c), thereby minimizing
91
the delay and the energy penalty for the data transfer.
4.2.3 Circuit-Aware Behavioral, Energy, and Delay Models
The analog-intensive FR and BLP operations of the DIMA-AD are intrinsically vulnerable
to various sources of noise due to its low-SNR operation. The dominant sources of noise
in the DIMA-AD are: (1) local transistor threshold voltage Vth-variation across bitcells
caused by random dopant ﬂuctuations, and (2) input oﬀset of the analog comparator. This
section derives behavioral models of the non-ideal behavior of DIMA-AD to predict system
performance. Energy and delay models are also provided.
4.2.3.1 Behavioral Model
In the FR operation, the Vth variations were modeled as a Gaussian distributed random
variable in [35]. In this section, two binary numbers a and p (we omit indices i and j for
simplicity) are FR. The impact of Vth-mismatch on the BL/BLB voltages is modeled as
follows:
fVBL(VBL; a, p) = N (VPRE − (a+ p)∆VBL, (a+ a)σ2cell)
fVBLB (VBLB; a, p) = N (VPRE − (a+ p)∆VBL, (a+ p)σ2cell) (4.21)
where fVBL(VBL; a, p) and fVBLB(VBLB; a, p) are the probability density functions of VBL and
VBLB, respectively, parametrized by a and p. N (µ, σ2) is the normal distribution with mean
µ, and variance σ2, and σ2cell is the variance of ∆VBL due to Vth variation across the storage
array A. It is assumed that Vth variations for the bit- and replica cells are identical.
92
The comparator outputs XBL and XBLB are obtained as:
XBL =
0 if VBL < Vref + Voffset1 otherwise
XBLB =
0 if VBLB < Vref + Voffset1 otherwise
f(VOS) = N (0, σ2comp) (4.22)
where an input oﬀset voltage (Voffset) of the comparator is modeled as a zero mean Gaussian
random variable with variance σ2comp.
The charge injection noise in the switches and thermal noise/mismatch of capacitors in
the capacitive adder are made negligible by ensuring C > 10 fF [81]. The single comparator
at the output of capacitive adder can be designed to have a small input oﬀset by using large
transistor sizes and calibration techniques.
The behavioral models in this section are validated in Section 4.2.4.
4.2.3.2 Delay and Energy Models
The delay per READ of the conventional SDM and the DIMA-SDM are described as follows:
TSDM = TAD + TCA (4.23)
TAD = (I/M)(J/BIO)Tread
TCA = SH,max(J/BIO)Tread +M dJ(Bc +Bx)/NGBLeTGBL
TDIMA−SDM = TDIMA−AD + TCA−HBD (4.24)
TDIMA−AD = (I/M)Tread
TCA−HBD = SH,max(J/BIO)Tread +M dJ/NGBLeTGBL
93
0
5
10
15
20
25
30
35
40
45
50
0 500 1000 1500 2000
N
o
rm
al
iz
e
d
 d
e
la
y
Number of blocks (M)
12.5X
2.0X
6.2X
25.5X 
SDM (B   =8) SDM (B   =16)
SDM (B   =32) SDM (B   =64)
DIMA-SDM w/o HBD DIMA-SDM
IO
IO
IO
IO
Figure 4.11: Normalized delay of single READ operation with BIO = 8− 64
(BIO : J = 1 : 4− 32).
where TAD (TDIMA−AD) and TCA (TCA−HBD) are the delay for AD (DIMA-AD) and CA
(CA-HBD). In addition, Tread is the delay for a single read access for memory arrays A and
C, TGBL is the delay in transferring a single bit via the GBLs, NGBL is the number of GBLs,
and SH,max is the maximum number of selected addresses per block. It is assumed that all
other blocks are operating in parallel while the memories are being accessed. Hence, the
delay of blocks such as the logic blocks in the conventional AD and the capacitive adder in
DIMA-AD are not included in (4.23)-(4.24). The factors (I/M) and (J/BIO) in TAD are
equal to the number of rows and the number of read outs per row, respectively, in ADm.
The partial sums per block are transferred serially through NGBL GBLs, thus requiring
dJ(Bc +Bx)/NGBLe cycles.
The throughput enhancement of DIMA-SDM over SDM derives from: (1) J/BIO ≥ 4 in
TAD, and (2) Bc+Bx ≥ 8 in TCA. The delay models in (4.23)-(4.24) are plotted in Fig. 4.11,
where it is assumed that Tread and TGBL take two clock cycles. DIMA-SDM demonstrates a
25× smaller delay compared to SDM with BIO = 8 due to the high bandwidth of DIMA-AD
when M = 4. The beneﬁt of HBD at M = 2048 is evident as there is a 3.2× additional
delay reduction as compared to DIMA-SDM without HBD.
94
The energy consumption per READ of the conventional SDM and the DIMA-SDM are
modeled as follows:
ESDM = EAD + ECA (4.25)
EAD = I[(J/BIO)(EPRE + Eleak) + JESA + Elogic]
ECA = SHBc[(J/BIO)(EPRE + Eleak) + JESA]
EPRE = JCBL∆VBLVPRE
Eleak = IJPleak_cellTread
EDIMA−SDM = EDIMA−AD + ECA−HBD (4.26)
EDIMA−AD = I(2EPRE + Eleak + 2JEcomp + Ea_add)
ECA−HBD < ECA
where EAD (EDIMA−AD) and ECA (ECA−HBD) are the energy consumptions of the AD
(DIMA-AD) and CA (CA-HBD), respectively. EPRE and Eleak are the energy consump-
tions of the precharge and bitcell leakage for the entire memory array A, respectively, and
ESA (Ecomp) is for a single unit of sense ampliﬁer (analog comparator). Pleak_cell is the leak-
age power of each bitcell. The energy consumptions of the analog capacitive adder in the
DIMA-AD and logic blocks in the conventional AD per single Hamming distance computa-
tion are denoted by Ea−add and Elogic, respectively. Energy consumptions from other blocks
such as WL drivers [82], CA's decision blocks are assumed to be negligible. It is assumed
that the AD and CA can be placed into a deep sleep mode independently [49]. Note that
EAD  ECA because I  SHBc. The EDIMA−AD has a scaling factor of two for the ﬁrst and
third terms as DIMA-AD reads aij and pj, and employs two comparators per bitcell column.
The energy eﬃciency of DIMA-AD derives from the fact that: (1) the ﬁrst term in EAD and
EDIMA−AD is the largest, and because J/BIO ≥ 4, (2) Ea_add  Elogic as the capacitances
in the capacitive adder are very small, e.g., 10 fF, and the capacitive adder requires only
simple switching operations, and (3) leakage energy in EDIMA−AD is smaller than that in
95
02
4
6
8
10
12
14
0 8 16 24 32 40 48 56 64 72
N
o
rm
al
iz
e
d
 e
n
e
rg
y
Bit-width (BIO)
SDM
DIMA-SDM
12.4×
2.1×
Figure 4.12: Normalized energy based on models (4.25), (4.26) with ∆VBL = 75 mV and
125 mV (obtained from Section 4.2.4.4) for SDM and DIMA-SDM, respectively.
EAD because the high-throughput (see delay models (4.23)-(4.24)) of DIMA-SDM permits
it to be placed into a deep sleep mode much quicker than SDM [49].
The energy models (4.25) and (4.26) using typical design parameters from Table 4.2 are
plotted in Fig. 4.12. The component values of (4.25) and (4.26) obtained from Section 4.2.4.
Figure 4.12 indicates that DIMA-SDM achieves energy reductions of 2.1× to 12.4× over
SDM.
4.2.4 Simulation Results
In this section, we apply SDM for handwritten digit recognition. Monte Carlo circuit
(HSPICE) simulations in 65 nm CMOS process technology are employed to validate the
models in (4.21) and (4.22). These models are employed in system simulations to estimate
the output bad pixel ratio (Bo). Energy and throughput beneﬁts are demonstrated via
circuit simulations and the energy/throughput models (4.23), (4.24), (4.25), and (4.26).
96
4.2.4.1 System Conﬁguration
Nine 16 × 16 binary shapes of numbers from 1 to 9 are employed to generate p and d as
shown in Fig. 4.13(a) and (b). For each of the nine patterns, 225 noisy copies with input
bad pixel ratio Bi = 0.25 are generated by randomly ﬂipping 25% of the bits. These images
form the training data set St of size 255 × 9 = 2295 and are written into the SDM in the
auto-associative mode (p = d).
In hetero-associative mode, during the training phase, p and d are assigned images cor-
responding to consecutive numbers, e.g., if p is assigned image corresponding to 4 then d
is assigned the image corresponding to 5. Thus, during the READ operation, the SDM
retrieves the image corresponding to the number that is one greater than the input number,
as shown in Fig. 4.13(b).
After the training, 100 contaminated copies of each pattern (total of 900 inputs) with
Bi = 0.15, 0.25, 0.3 are generated and provided as the address p for classiﬁcation. Four
READ iterations of the auto- and hetero-associative memory are performed. The error
immunity against faulty hardware increases with larger R in (4.8) as each data vector d is
distributed across greater number of physical addresses. On the other hand, R needs to be
small enough not to create excessive intersection between physical addresses for diﬀerent
number's images. To balance this trade-oﬀ, R for WRITE and READ operations are set to
79 and 82, respectively.
The block size I/M = 512 is chosen to balance the read out delay and area eﬃciency of
memory. The value of BIO = 64 is chosen as its typical value of J/BIO ranges from 4 to 32
in conventional SRAM architectures [49], in order to permit the maximum bandwidth. In
this speciﬁc application,M = 4 and SH ≈ 0.1I, which will be used in the rest of this section.
The design parameters (other than BIO and M) used in the simulations are summarized in
Table 4.2.
4.2.4.2 Model Validation
Monte-Carlo circuit simulations show that σcell/∆VBL = 6.5% and the analog comparator
has an input oﬀset σcomp = 18 mV.
97
Input n=1 n=2 n=3
0.25 0.23 0.16 0Bo
0.25 0.23 0.04 0Bo
(a)
0.25 0.24 0.05 0
0.25 0.24 0.03 0
Input n=1 n=2 n=3
Bo
Bo
(b)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0 1 2 3 4
B
ad
 p
ix
e
l r
at
io
 (
P
o
)
READ iteration (n)
Pi = 0.3
Pi = 0.25
Pi = 0.15
SDM
CM-SDM 
w/o HBD
(B
o
) DIMA-SDM
DIMA-SDM
Bi
B
B
(c)
Figure 4.13: Behavior of SDM, DIMA-SDM (without HBD) and DIMA-SDM with M = 4
in: (a) auto-associative mode, (b) hetero-associative mode, and (c) the output bad pixel
ratio Bo[n] in the auto-associative mode.
98
Table 4.2: Design parameters for SDM.
Parameter Value Parameter Value
VDD(= VPRE) 1 V M 4-2048
I/M 512 J 256
Bc(= Bx) 4 BIO 8-64
Clock frequency 1 GHz NGBL 256
C 10 fF CBL 230 fF
0
200
400
600
800
1000
1200
0 1 2 3 4 5 6 7 8
V
su
m
[m
V
]
Hamming distance
Circuit
Model
Figure 4.14: Vsum from circuit simulations and system simulations with behavioral models
(4.21) to (4.22) with J = 8 and ∆VBL = 50mV .
The behavioral models of the entire analog signal processing chain from the bitcell to
the ﬁnal output Vsum are validated as shown in Fig. 4.14, where the results of Monte-
Carlo circuit simulations are compared with those from system simulations employing the
behavioral models (4.21)-(4.22). Nine diﬀerent combinations of p and ai (with J = 8) are
chosen as inputs. Figure 4.14 indicates that the maximum modeling error is 4.5% of the
dynamic range of Vsum. This level of accuracy is suﬃcient for system performance estimation,
as it is much less than the smallest non-zero Hamming distance.
99
p
Output
storage
y
A
Charge-based processing
BL-ASP
M
R
-R
EA
D
 d
ri
ve
r
0.00
0.10
0.20
0.30
0.40
0.50
0.5 1 1.5 2 2.5 3
B
ad
 p
ix
e
l r
at
io
 (
B
I o
)
Normalized energy
SDM
CM-SDM
∆VBL
decrease
∆VBL
decreaseB
ad
 p
ix
e
l r
at
io
 (
B
IO
)
DIMA-SDM(B
o
)
Figure 4.15: Energy vs. Bo trade-oﬀ with n = 4. Here ∆VBL = 25 mV-125 mV for SDM
and 75 mV-175 mV for DIMA-SDM.
4.2.4.3 System Performance
The output bad pixel ratio Bo[n] in the n-th READ iteration is computed as:
Bo[n] =
1
900J
900∑
k=1
Hk[n] (4.27)
where Hk[n] is the Hamming distance between the SDM output yk[n] at time index n and
the ideal output for the k-th input image.
Figure 4.13(c) shows that the conventional SDM, the DIMA-SDM (without HBD) and the
DIMA-SDM, all converge to achieve a Bo less than 2% for n ≥ 3 when Bi ≤ 25%. Similar
results were observed for the hetero-associative mode as well. Furthermore, SDM and DIMA-
SDM were found to achieve Bo[n] that were within < 5% from each other for n ≤ 3 (the Bo of
DIMA-SDM is slightly worse), for all three values of Bi. The Bo of DIMA-SDM was higher
than SDM by only 0.4% for n = 4 and Bi = 25% indicating that the non-ideal behavior of
DIMA-SDM is successfully compensated by the inherent noise immunity of SDM and the
associative mode of operation. The Bo degradation of the DIMA-SDM can be reduced by
increasing the number of blocks M with large I so that more averaging can occur.
100
0.00
0.04
0.08
0.12
0.16
0.20
SDM DIMA-SDM
w/o HBD
DIMA-SDM
En
e
rg
y/
R
EA
D
[μ
J
]
CA
Cap. Adder
(DIMA & DIMA-HBD)
Digital logic
(Conv. SDM)
Cell Leak.
SA (Comp.)
Precharge
1.8× 2.1×
E A
D
E
CA
EDIMA-AD
ECA-HBD
Figure 4.16: Energy breakdown for a single READ operation with BIO = 64, and
∆VBL = 75 mV for SDM and 125 mV for DIMA-SDM.
4.2.4.4 Delay and Energy Savings
The conventional SRAM read access and FR require two clock cycles, and the data transfer
from the LDB to the GDB also requires two cycles. The proposed DIMA-SDM achieves
3.1× smaller delay over SDM as shown in Fig. 4.11 due to high bandwidth of DIMA-SDM
with M = 4.
The various components of the energy models in (4.25) and (4.26) are measured via
HSPICE simulations. To do so, the parasitic capacitance of BL (CBL = 230 fF) is ex-
tracted from the layout of an SRAM bitcell. These energy components are a function of the
BL swing ∆VBL. The intrinsic robustness of SDM and the associative mode of operation
enable a lower value of ∆VBL to be employed as compared to a typical value in standard
SRAM, thereby resulting in even greater energy savings.
Figure 4.15 shows the trend of Bo with ∆VBL scaling, and the Bo > 2% when ∆VBL <
75 mV and 125 mV in the conventional SDM and DIMA-SDM, respectively. Thus, energy-
optimal ∆VBLs are applied to both conventional and DIMA-SDM in order to obtain the
energy breakdowns in Fig. 4.16. This ﬁgure shows that DIMA-SDM achieves approximately
2.1× reduced energy as compared to SDM.
101
4.3 Conclusion
In this chapter, the versatility of DIMA has been proven with two applications: CNN and
SDM. More speciﬁcally, the SDM provides great potential to address stochastic and unre-
liable behavior of nanoscale fabrics due to its inherent robustness from ensemble decision-
making process. However, conventional digital architectures fail to exploit the error tolerance
for throughput and energy beneﬁts. This chapter proved that DIMA can be a possible solu-
tion to such memory-intensive inference algorithms/applications. It was also demonstrated
that the energy and throughput beneﬁts can be further improved by co-optimizing the al-
gorithm as shown in the HBD for SDM and error-aware re-training for CNN. The beneﬁts
of proposed architectures are expected to increase with data volume and input data size
by saving the data movement costs more. Extensions to architectures based on emerging
memory topologies are potential future directions.
102
Chapter 5
MATI: DIMA INSTRUCTION SET ARCHITECTURE
The signiﬁcant energy and throughput beneﬁts achieved by multi-functional DIMA natu-
rally motivated us to consider a programmable deep in-memory instruction set architecture
(ISA). However, as DIMA relies on array pitch-matched analog computations, their beneﬁts
have been demonstrated only for four ﬁxed-function scenarios. This raises the question: Can
DIMA be made programmable without losing much of their energy and throughput beneﬁts
over their digital counterparts? Answering this question is complicated because such ar-
chitectures rely heavily on highly area-constrained (array pitch-matched) low-swing analog
computations, and therefore introducing programmability into such architectures is a major
challenge.
In this chapter, we propose MATI, a programmable deep in-memory ISA, that addresses
programming challenges in DIMA via a synergistic combination of instruction set, architec-
ture and circuit design. MATI builds upon the multi-functional (four-function) DIMA [39]
introduced in Chapter 3, which will be called Compute Memory (CM) in this chapter, in
order to enable a wide range of ML algorithms to be executed largely within the low-swing
mixed-signal domain. To achieve this goal, a number of challenges in the design of the
instruction set, architecture and circuit need to be addressed synergistically, including the
following:
1. analog ISA challenge: the instruction set needs to accommodate the intrinsically
sequential and analog nature of the four-stage architecture, while enabling a diversity
of tasks that may be encountered across benchmarks, and at the same time expose the
underlying CM mechanisms to be exploited by the compiler.
2. functional density challenge: each stage needs to implement many more functions
103
and options (CM has 8 while MATI realizes 32 functions spread across four stages)
while satisfying geometric pitch-matching constraints imposed by the BCA and the
inter-stage dynamic range matching issues.
3. throughput-accuracy loss challenge: the number of possible tasks increases dramat-
ically (CM has 4 while MATI has more than 120) which causes a loss in throughput
due to the increase in the worst-case delay. This loss in throughput leads to a loss in
the accuracy of analog computations due to increased leakage.
We show that MATI achieves a high level of programmability without losing the eﬃciency
beneﬁts of deep in-memory computing via a synergistic combination of instruction set, ar-
chitecture and circuit design. In particular, we propose the following techniques to address
the above mentioned issues:
1. We address the analog ISA challenge by the use of a 51-bit macro (task level) instruc-
tions comprising four classes that are closely matched to MATI's four-stage architecture
and one additional class that allows an independent control of each stage function, as
well as control of the data format, BL voltage swing, and ADC precision.
2. We address the functional density challenge by node sharing and reconﬁguration to
minimize the number of circuit elements needs per stage.
3. We address the throughput-accuracy loss challenge by a combination of analog pipelin-
ing, as late as possible (ALAP) scheduling of operations within each stage, and dynamic
task period (DTP) assignment.
Employing silicon-validated energy, delay and behavioral models of deep in-memory com-
ponents, we demonstrate that MATI is able to realize nine (potentially many more) ML
benchmarks while incurring negligible overhead in energy (< 0.1%), and area (4.5%), and
no overhead in throughput, over a pipelined version of CM [39]. In this process, MATI is
able to simultaneously achieve enhancements in both energy (2.5× to 5.5×) and through-
put (1.4× to 3.4×) for an overall energy-delay product improvement of up to 12.6× over
ﬁxed-function digital architectures. This work indicates the potential beneﬁts of designing
104
Table 5.1: Machine learning (ML) algorithms.
f(D(W,X))
Inner loop kernel
f( )
D(W,X)
=
∑N
i=1 d(w[i], x[i])
SVM
∑N
i=1w[i]x[i] sign
Temp. Match. (L1)
∑N
i=1 |w[i]− x[i]| min
Temp. Match. (L2)
∑N
i=1(w[i]− x[i])2 min
DNN
∑N
i=1w[i]x[i] sigmoid
Feature extraction (PCA)
∑N
i=1w[i]x[i] −−
k-NN (L1)
∑N
i=1 |w[i]− x[i]| majority vote
k-NN (L2)
∑N
i=1(w[i]− x[i])2 majority vote
Matched ﬁlter
∑N
i=1w[i]x[i] min
Linear Regression
∑N
i=1w[i] accumulate∑N
i=1w[i]
2 accumulate∑N
i=1w[i]x[i] accumulate
a full-ﬂedged compiler to map a wide variety of applications onto MATI, a key goal of our
current work.
In Section 5.1, we describe various ML algorithms and provide an introduction to DIMA.
MATI and its instruction set are described in Section 5.2. Section 5.3 describes the validation
methodology followed by experimental results in Section 5.4. Section 5.5 concludes this
chapter.
5.1 Background
In this section, we ﬁrst analyze a variety of ML algorithms in order to identify commonalities
in their data ﬂow and then review DIMA brieﬂy to show that these architectures are well-
suited for ML algorithms.
ML algorithms [83] can be described by a series of nested loops where the innermost
loop is the computation of the vector distance (VD) denoted as D(W,X), between N -
dimensional input vector X and weight vector W as shown in Table 5.1 and Listing 5.1.
Commonly used VDs include the dot product (DP), L1 distance (Manhattan distance), L2
distance (Euclidean distance), and Hamming distance (HD). In ML algorithms such as the
SVM, template matching, DNN, k-NN, and matched ﬁlter, VD computation dominates the
105
Listing 5.1: Pseudo code for core computations of inference / machine learning (ML)
algorithms.
// x : e lement o f input v e c t o r X
// y : e lement o f output v e c t o r Y
// w[ j ] [ i ] : i−th e lement o f j−th we igh t v e c t o r W
y [ 1 : No ] = 0 ;
for ( j = 0 ; j < No ; j++) {
for ( i = 0 ; i < N ; i++) {
y [ j ] += d(w[ j ] [ i ] , x [ i ] ) ;
}
y [ j ] = f ( y [ j ] ) ;
}
computational and memory access cost as N can be quite large (e.g., 128 to 1024) and, being
located in the innermost loop, this computation is invoked very frequently. Additionally,
most ML algorithms have the following data-ﬂow properties in common:
1. A single VD is obtained by ﬁrst computing N element-wise scalar distances (SDs)
(d(w[i], x[i])) followed by an aggregation step, such as a sum or average, that generates
the ﬁnal scalar VD D(W,X) =
∑N
i=1 d(w[i], x[i]).
2. The VD between a single query vector X and multiple (say No) weight vectors Wj
(j = 1, 2, ...No) needs to be computed, e.g., template matching computes VDs between
one query vector X and a large number of Wjs, while a DNN needs to compute the
dot-product between an input feature map X and many weight kernels Wj to obtain
No output feature maps.
The conventional and DIMA architectures for inference algorithms are shown in Fig. 5.1(a)
and (b), respectively. As described in Chapter 2, DIMA minimizes memory access costs
associated with Wj via FR and the cost of VD computations drastically by realizing them
in the analog domain via BLP in close proximity to the BCA.
106
W
L 
D
ri
ve
r
W
L 
D
ri
ve
r
Bitcell
Bitcell
B
L 1
BL
B
1
Bitcell
Bitcell
B
L 1
BL
B
1
Sense Amp.
1
L : 1 L : 1
Precharge
CTRL
PRECHARGE
C
TR
Le
r
Sense Amp.
Ncol/L
Digital processor
D
e
ci
si
o
n
CTRL
Ncol X Nrow
6T SRAM
BCA
             Operand data buffer (ODB)   
CTRL
X
W
(a)
BLP
1
BLP
2
ADC
BLP
Ncol-1
BLP
Ncol
Cross BL processor (CBLP)
C
TR
Le
r
PRECHARGE
CTRL
Precharge
          Operand data buffer (ODB)   
D
e
ci
si
o
n
CTRL
Ncol X Nrow
6T SRAM
BCA
W
X
P
W
M
 W
L
 D
ri
ve
r
P
W
M
 W
L
 D
ri
ve
r
RDL
(b)
Figure 5.1: Block diagrams of: (a) a conventional digital (SRAM-based) architecture, and
(b) deep in-memory architecture (DIMA). Analog (shaded) and digital (unshaded) blocks
are indicated. The SRAM size is Nrow ×Ncol, and Bw and Bx are the bit precisions of
scalar weight w and scalar data x, respectively. Note that DIMA realizes the bulk of its
computations in analog.
107
S1 S2 S4
Repeat 1
S1: MR-READ S2: BLP+CBLP S3: ADC
Repeat 2 S3
S1 S2 S4S3
S4: RDL
              
                  
(a)
S1 S2
idle time
Repeat 1
Repeat 2
S3 S4
S1 S2 S3 S4Active time
  
                                  
                            
(b)
S3S2S1
S3S2S1Repeat 1
Repeat 2
S4
S4
          
                                     
(c)
S2S1Repeat 1
Repeat 2 S2
Repeat 1
Repeat 2
S2S1
S2S1
T
as
k 
1
T
as
k 
2 S3
S1 S3
S3 S4
ALAP
       
           
     
      
(d)
Figure 5.2: Sequential four-stage processing in DIMA for various cases: (a) un-pipelined
ﬁxed-function, (b) with operational diversity per stage, (c) with pipelining, and (d) as late
as possible (ALAP) processing for S1, and dynamic task period (DTP) assignment.
108
5.2 MATI: A Programmable DIMA
We begin this section by describing in greater detail the various challenges that arise when
programmability needs to be introduced into DIMA [39] along with proposed solutions to
those challenges. These solutions are then leveraged to develop MATI, a programmable
DIMA and its ISA.
5.2.1 Programmability Challenges in DIMA
The combination of algorithmic diversity and analog-heavy operations in DIMA creates a
number of challenges that need to be addressed via a synergistic application of techniques
at the instruction set, architecture, and circuit levels.
5.2.1.1 Analog ISA
DIMA's analog procesfsing chain imposes an intrinsic sequentiality in the order in which
each of its four stages (S1-S4) needs to operate, i.e., S1:MR-READ → S2:BLP+CBLP →
S3:ADC → S4:residual digital logic (RDL). Additionally, in order to cover a broad range of
ML benchmarks, each stage needs to provide a suﬃcient number of options for the opera-
tions (operational diversity) it can execute. For example, the distribution of the number of
operations in CM [39] is (S1= 3, S2= 2, S3= 1, S4= 2) whereas the corresponding distribu-
tion for MATI is (S1= 6, S2= 10, S3= 9, S4= 7). Therefore, MATI's instruction set needs
to accommodate this sequentiality along with analog circuit constraints for any arbitrary
combination of operations required to realize a speciﬁc ML algorithm. By describing ML
algorithms in terms of tasks, where each task is a speciﬁc sequence of operations obtained by
selecting one in each stage, we can meet this sequentiality constraint. This is why MATI de-
ﬁnes macro instructions which are composed of four Classes, closely matched to its four-stage
architecture. In addition, consecutive processing stages need to be physically co-located to
avoid substantial degradation in analog levels from one stage to the next. Furthermore,
large capacitor ratio between consecutive stages (input-output 20:1) is required to limit the
voltage drop < 5%.
109
BL Processor
(BLP)
(mult/comp/
abs/sq)
Cross BL processor (CBLP)
Fu
n
ct
io
n
a
l W
L 
d
ri
ve
r
512 X 256
6T SRAM
Bitcell Array
Bitcell
Bitcell
Read/Write path
X
BL Processor
(BLP)
(mult/comp/
abs/sq)
CTRL
C
TR
Le
r
R/W CTRL
task1
task2
task3
task4
Instruction. Reg.
Residual Digital Logic (RDL) Bank
Bitcell
Bitcell
Sampler bank
Analog add.
(charge sharing)
ADC bank
operand data buffer (ODB) 
[7:0][255:0]
MAIN
CTRLer
[0] [1] [2] [3]
OUT
Sampler & ADC
64-b IO
C
la
ss
 2
(B
LP
, C
B
LP
)
C
la
ss
 1
 (
M
R
-R
EA
D
)
C
la
ss
 3
(A
D
C
)
Class 4 (RDL)
Write data buffer (WDB)
Binary
Inst.
CTRL
CTRL
OUT
cu
rr
en
t
T p
er
io
d
CLASS 4
CTRLer
m
ea
n
th
re
sh
o
ld
Si
gm
o
id
acc.
R
e
Lu
ACC_NUM
counter
CLK
VWL
DAC CTRL
m
in
m
ax
ADD
CLASS 3
CTRLer
CLASS 2
CTRLer
CLASS 1
CTRLer
start
start
start
start
Fu
n
ct
io
n
a
l W
L 
d
ri
ve
r
W
Figure 5.3: The MATI architecture (analog blocks shaded) in a single bank conﬁguration
comprising a 16 KB SRAM BCA, standard read and write paths (bottom), in-memory
computation blocks (middle), and residual digital logic (RDL) and the controller (CTRLer)
on the right. The natural sequentiality imposed by MATI's analog processing is exploited
by deﬁning Class shown next to each stage.
5.2.1.2 Functional Density
While each stage needs to provide a suﬃcient number of operations, unlike conventional
architectures, this operational diversity needs to be provided while meeting stringent geo-
metric pitch-matching constraints imposed by the BCA. The functional density challenge is
addressed by sharing circuit components within each stage. For example, a single comparator
in BLP is used for three operations: comparison, absolute value, and signed multiplication.
Also, the capacitive multiplier in BLP is also used as an analog latch to support pipelined
operations. Through such component sharing techniques, we were able to provide a ﬁve-
function BLP vs. ﬁve-function BLP in CM [39] with an additional 16% increase in the area
of the BLP. Note that the BLP itself consumes approximately 25% of the overall area and
thus the impact of such area increases led to about a 4.5% increase in the total area.
110
5.2.1.3 Throughput-Accuracy Loss
The increased operation diversity of each stage leads to an increase in the task period TP
when accommodating the worst-case delay of each stage as shown in Figs. 5.2(a) and (b).
This loss in throughput can be severe - up to 2× degradation when designing for the 9 ML
benchmarks addressed in this chapter. Additionally, incorporating operational diversity also
leads to idle times in each stage. The presence of these idle times is problematic in analog
computation as analog values, which are typically stored on (area-constrained) capacitors,
will degrade due to various leakage mechanisms, eventually leading to a loss in accuracy at
the application level. Thus, there is a throughput and accuracy loss caused by introducing
programmability features in DIMA.
We propose analog pipelining to address the loss in throughput (Fig. 5.2(c)). To accom-
plish such pipelining, we employ pre-existing capacitor-based samplers that were previously
used for isolation in CM stages [39], as an analog latch between each stage. As this tech-
nique is straightforward to realize and could have been used in CM, we compare MATI with
a pipelined CM (pipe-CM) when evaluating its beneﬁts in section 5.4.
Though analog pipelining increases throughput, it does not solve the accuracy problem
caused by idle times. In fact, the accuracy problem may even be worsened as all stages need
to operate with a common period TP (Fig. 5.2(c)), which is equal to the worst case delay
across all stages. The idle time (Fig. 5.2(c)) of a speciﬁc stage is equal to the diﬀerence
between TP and S1 delay for the task. This problem is worst for S1 (MR-READ) as each BL
is subject to the leakage contributions from all the bitcells in that column. To address the
accuracy problem, we employ an as-late-as-possible (ALAP) (Fig. 5.2(d)) schedule for S1 so
that its idle time is minimized. In order to obtain further throughput gains and minimize the
accuracy problem, we employ dynamic task period (DTP) assignment (Fig. 5.2(e)), where
the task period TP is adapted in a task-dependent manner as described in Section 5.2.3.3.
The rest of this chapter employs the solutions outlined in this section to develop the MATI
architecture and its instruction set.
111
5.2.2 MATI Architecture
The MATI architecture shown in Fig. 5.3 employs a 1 GHz main clock frequency, the same
as [39], to be able to generate control signals with a ﬁne resolution. It preserves the stan-
dard SRAM read and write functionality for additional ﬂexibility. MATI's components are
described as follows:
1. Multi-row READ (MR-READ): This stage corresponds to fetching Wjs in List-
ing 5.1 along with the option of realizing element-wise addition and subtraction during
the read process itself. All 256 columns in a word row are accessed simultaneously. An
8-bit word is split across two columns in a word row by storing 4-bit MSB and 4-bit
LSB in neighboring columns, to support sub-range reads to enhance linearity. Thus,
a 128-element vector of 8-bit elements is read to generate 128 analog BL values per
access.
2. Operand data buﬀer (ODB): The operand data buﬀer (ODB) holds eight 128-
element vectors representing the second operand X in Listing 5.1, which can be reused
as many times as necessary.
3. BL processor (BLP) and cross BLP (CBLP): The BLP stage computes the 128
SDs d(w, x) such as absolute value, compare, square, and multiplication. The CBLP
stage aggregates the 128 vector elements by charge-sharing to compute D(W,X) =∑N
i=1 d(x[i], w[i]).
4. Sampler and ADC: MATI has four analog samplers and eight slow but area-eﬃcient
single-slope ADCs to convert the CBLP outputs into a digital format for further pro-
cessing.
5. Residual digital logic (RDL): The RDL computes the non-linear functions f() in
Listing 5.1 as well as aggregation needed when the vector length N >128. Thus, the
RDL consumes negligible energy as it is infrequently activated. Non-linear operations
such as sigmoid are implemented in the RDL via piece-wise linear approximation [84].
We describe the MATI instruction set next.
112
Class0
(34 bits)
Burst
(7 bits)
Class1
(3 bits)
Class2
(4 bits)
Class3
(1 bits)
Class4
(3 bits)
Class0
(34 bits)
Burst
(7 bits)
Class1
(3 bits)
Class2
(4 bits)
Class3
(1 bits)
Class4
(3 bits)
Class0
(34 bits)
Burst
(7 bits)
Class1
(3 bits)
Class2
(4 bits)
Class3
(1 bits)
Class4
(3 bits)
Class 0
(33 bits)
RPT_NUM
(7 bits)
Class 1
(3 bits)
Class 2
(4 bits)
Class 3
(1 bits)
Class 4
(3 bits)
Task1~4
OPCODE
Class0 contents Bits Description
SWING [32:30] ΔVBL swing control code – 000: min (0mV), 111: max (500mV)
ACC_NUM [29:28] # of operands to be accumulated for acc. mode in class4
A_ADD [27:19] Bitcell array (BCA) address for Class 1
B_ADD1 [18:16] ODB address (0~7) of operand for Class1
B_ADD2 [15:13] ODB address (0~7) of operand for Class2
B_PRD [12:11] B_ADD1&2 circulate from 0 to B_RPD-1
ADC_FREQ [10:9]
ADC operates - 00: never, 01: every iteration, 10: every other iteration, 
11: once every four iteration
ADC PREC [8:6] ADC output bit precision (1~8 bits)
DES [5:4]
Class 4 output destination – 00: ACC output buffer,
01: outside, 10: ODB, 11: WDB
THRES_VAL [3:0] Threshold value for threshold operation in Class4
(a)
Class Operation Bit length
OP
code
Option
1
none
3 bits
(OPCODE)
000
B_PRD:
B_ADD1 circulates from 0 to B_RPD-1
write(A_ADD) 001
read(A_ADD) 010
mr_read(A_ADD) 011
sr_read(A_ADD) 100
mr_subt(A_ADD, B_ID1) 101
mr_add (A_ADD, B_ID1) 110
2
none
4 bits
(OPCODE
+ C.S)
000 Charge Sharing (C.S) bit:
0: No charge-share
1: charge-share for reduction
B_PRD:
B_ADD2 circulates from 0 to B_RPD-1
avg 001
abs 010
square 011
sign_dot(B_ID2) 100
unsign_dot(B_ID2) 101
3
none 1 bit
(OPCODE)
0
ADC 1 ADC_FREQ, ADC_PREC
4
accumulation
3 bits
(OPCODE)
000 DES, ACC_NUM
mean 001 DES
threshold 010 DES, THRES_VAL
max 011 DES
min 100 DES
sigmoid 101 DES
ReLu 111 DES
A
n
al
o
g
D
ig
it
al
(b)
Figure 5.4: Instruction set of MATI: (a) instruction format, and (b) operations in each
class.
113
5.2.3 MATI Instruction Set
The MATI architecture is unconventional, and the ISA reﬂects the unique features of the
architecture. Below, we describe the ISA and justify our design choices.
5.2.3.1 Fields of Instruction Set
The key features of the MATI instruction set (Fig. 5.4) are as follows:
1. Tasks : A MATI program consists of one or more tasks, or macro-instructions. A single
task consists of sequential operations from ﬁve Classes 0-4, described below. Classes
1-3 perform the distance computation, D(W,X) =
∑N
i=1 d(x[i], w[i]), of Listing 5.1
whereas Class 4 completes the f(D(W,X)). A single algorithm may sometimes require
several tasks for diﬀerent distance metrics, e.g., linear regression requires three tasks,
as explained in Section 5.3. MATI can process multiple diﬀerent tasks sequentially.
2. Classes : Each MATI task is partitioned into six ﬁelds: Classes 0-4 and a repetition
count, RPT_NUM. Class 0 speciﬁes parameters to control the behavior of Classes 1-4,
including conﬁguration parameters (e.g., BL voltage swing ∆VBL, ADC control, Class
4 threshold), memory addresses, and other operand speciﬁers. The Class 0 parameters
are key to achieving ﬂexible programmability in the MATI architecture.
Classes 1-4 execute sequentially and specify the behavior of the four hardware mech-
anisms: BCA access and optional element-wise addition/subtraction (Class 1), BLP
and CBLP (Class 2), optional analog-to-digital conversion (Class 3), and a residual
digital operation (Class 4). An operation for each class can be chosen out of available
options listed in Fig. 5.4(b). MATI's hardware supports a nearly arbitrary combina-
tion of operations from each class. There are >120 possible combinations, but the
hardware imposes one restriction: the ODB can provide an operand to only one of
either Class 1 or Class 2 in each Task. This restriction is irrelevant in practice as the
distance computation, D(W,X) has just two operands: the ﬁrst operand comes from
Class 1, and the second operand X can come from either Class 1 or Class 2, but not
both.
114
3. RPT_NUM : This parameter speciﬁes the iteration count for the outer FOR loop of
Listing 5.1, i.e., how many times the task should be executed. The addresses for
the bitcell array (A_ADD) and ODB (B_ADD1 and B_ADD2) are incremented se-
quentially for each iteration. Although unconventional for modern RISC architectures,
this is a natural choice for the typical loop structure of MATI algorithms, which iter-
ate sequentially through data (w[i]) in memory to perform the distance computation,∑N
i=1 d(w[i], x[i]). This choice keeps the ISA simple by eliminating explicit operand
speciﬁers because that generality is unnecessary.
A more conventional ISA might deﬁne the operations as separate instructions, and (usu-
ally) express the RPT_NUM iterations using explicit control ﬂow. We chose the combined
macro representation because all MATI computations are straight-line code (i.e., a non-
branching loop body) executing for a number of iterations that can be computed before
entering the loop. An entire such loop can be mapped to a single MATI task (one or more
tasks per kernel), with no loss of ﬂexibility and no extra complexity in a code generator.
This is unlike a conventional general-purpose architecture executing arbitrary combinations
of operations, often including control ﬂow, where mapping to a single complex instruction
would incur far more complexity during the instruction selection.
Two conﬁguration parameters in Class 0 give software the ﬂexibility to trade-oﬀ application
accuracy for energy and latency. ADC_PREC speciﬁes the bit precision for ADC output,
which depends on the accuracy required by the algorithm. SWING controls WL voltage
level (VWL), which controls BL swing ∆VBL and directly impacts both the accuracy and
the energy. (Section 5.4 evaluates this accuracy-energy trade-oﬀ for several algorithms.) We
envisage that compiler or autotuner techniques can be used to map high-level application
metrics (e.g., accuracy of decisions) to these hardware-level parameters; this is a subject of
our ongoing work.
5.2.3.2 Application Example
We present an example of template matching with the L1 distance kernel to ﬁnd the closest
128-pixel image to input query image (X) out of 64 candidate images (Wjs). The template
115
matching is mathematically deﬁned as
jopt = arg min
j
128∑
i=1
|x[i]− w[j, i]| (5.1)
The instruction of Task includes RPT_NUM, which speciﬁes the number of candidate im-
ages, the "MR-READ and subtract" (mr-subt) operation (Class 1); the "absolute compu-
tation followed by aggregation (charge sharing)" (Class 2); ADC (Class 3), and ﬁnally a
digital-domain min operation that computes the f(· · · ) operator, arg minj (Class 4), and
parameters for Classes 1-4 (Class 0). Thus, the binary instruction for this application can
be expressed by "RPT_NUM :1000000, Classes 1-4: 101 0101 1 100". Akin to a VLIW
(a very large instruction word) architecture, a single task (wide-word macro instruction)
speciﬁes multiple classes of spatially separable operations to be performed; unlike a VLIW
architecture, however, the operations in an instruction are sequential rather than parallel.
5.2.3.3 Dynamic Task Period (DTP)
As explained in Section 5.2.1.3, if TP needs to accommodate the worst-case delay of all
possible combinations, the idle time will be increased as some of these require less time.
This leads to loss in throughput and an accuracy problem due to the leakage from the
sampled analog value at the analog latch. Thus, we employ a dynamic task period (DTP)
(Fig. 5.5) to apply optimal TPi per given i-th task as follows:
TPi = max(Ti,Class1, Ti,Class2) (5.2)
where Ti,Class1,2 is a delay of Class 1 or 2 in i-th task. Note that the multiple ADC operates in
parallel with other blocks' operations due to its high latency (unlike the conceptual diagram
in Fig. 5.2). The delay of Class 4 is negligible as it is a simple scalar computation. Thus,
Classes 3 and 4 are excluded in (5.2).
As shown in Fig. 5.3, the TPi is calculated in instruction register, which stores the delay
table per operation. The MAIN the CTRLer enables Class CTRLers with the period of TPi.
The Class 4 CTRLer is not synced by TPi as its delay is negligibly small, and thus it starts
116
Class3-ADC3   .  .   . .
Class2Class1Repeat 1
Repeat 2 Class2
T
as
k 
1
Class1 Class3-ADC2
Class3-ADC1
   
Repeat 3 Class1  .Class2
Repeat 1
Repeat 2
T
as
k 
2
Repeat 3
   
Class4
Class2Class1
Class2Class1 Class
Class3-ADC1
Class1 Class2
Figure 5.5: Timing diagram for MATI with DTP.
Listing 5.2: C++ interface for template matching (L1 distance).
/* X: temp la te query image
* W[ j ] : j ' th cand ida te image
* W. s i z e ( ) = ROW_NUM(W)*COL_NUM(W)
* minVal , maxLoc : ou tpu t s o f temp . match .
* Best match i s W[minLoc ] .
*/
templateMatching_L1 (W, X, &minVal , &minLoc ) ;
as soon as the ADC operation is completed (Fig. 5.5).
5.2.3.4 ISA to Program Mapping
We have built a library to map programmer-written C++ code for each of the benchmarks
in Section 5.3.3. More speciﬁcally, each kernel of interest is written in a MATI assembly
language we have deﬁned, and is wrapped into a library function that can be called from C
or C++. This allows the application programmer to simply use the library call in C or C++
code without being exposed to hardware details. For example, the following code excerpt
shows the C++ code for calling the template matching library operation (after setting up
the appropriate data structures).
The library performs the following steps:
1. Generate Class 0 instruction: Class 0 and RPT_NUM are dependent on the algo-
rithm and input data, and are thus generated at runtime. For example, RPT_NUM
117
= (W.size()/128), ACC_NUM = (W.size()/128)/4. Similarly, B_PRD = ADC_FREQ =
(W.size()/128).
2. Generate Class 1-4 opcodes : Classes 1-4 opcodes for each benchmark kernel are inde-
pendent of the input data: they essentially encode the algorithm. For now, these have
been hand-coded in the library.
3. Generate MATI kernel : After generating Class 0, RPT_NUM and Classes 1-4 instruc-
tions, the library concatenates them and generates the MATI ISA representation of
the application kernel. This is then shipped to the MATI simulator. This is equivalent
to the host processor constructing the MATI kernel code after computing the Class 0
ﬁelds and then transferring the kernel code to the MATI accelerator for execution.
For the present, MATI kernels must be written as assembly code by hand. A full-ﬂedged
compiler back-end for MATI is under development.
5.2.4 Extension to Large-Scale Applications
MATI is well suited to process 128-dimensional vector processing. Longer vectors can
be processed by repeating the 128-dimensional vector processing sequentially by setting
RPT_NUM = (W.size()/128) and other parameters as shown in Listing 5.2.
The multi-bank conﬁguration can be employed to process long vectors in parallel by dis-
tributing the vector elements in many banks (e.g., neighboring four banks). The multi-bank
also gives a potential energy beneﬁt by sharing one CTRLer over multiple banks amortizing
the CTRLer energy overhead. This mode is employed by assigning ﬁve more bits in the
instruction set: one bit turns on the multi-bank mode; the other four bits deﬁne which bank
is turned on.
Even larger and more complex applications such as AlexNet require frequent DRAM ac-
cesses. However, several prior works [26, 85] minimized oﬀ-chip DRAM access by extensive
data reuse and achieved parallelization by many processing elements including a local mem-
ory. More speciﬁcally, it was shown that the local memory (SRAM) access and processing
energy left as the dominant portion after applying the data reuse techniques [26]. In this
118
Analog CORE
Large & small
datasets
TSMC 65nm
CMOS PDK
Delay
Inference outputs
(Decisions)
Verilog
Digital CTRLer & RDL
Synthesis & 
place and route
n
e
tl
is
t
Verilog
Behavioral
code
SD
F 
fi
le
V
C
D
 f
il
e
Compare
Instruction
word
Throughput
estimation
Decisions
Merge Energy
estimation
A
p
p
li
ca
ti
o
n
le
ve
l
A
rc
h
it
e
ct
u
re
le
ve
l
C
o
m
p
o
n
e
n
t
le
ve
l
Energy
Energy
App.-level
accuracy
validation
Arch.-level
functional
validation
Small dataset
C++ (with 
behavior LUT)
Verilog-A
(with LUTs)
SPICE 
simulation
Behavior
Circuit
schematic
& layout
Silicon
validated
data
MATI arch. sim
Figure 5.6: MATI validation methodology.
case, MATI has strong potential to improve system energy eﬃciency and throughput by
replacing those processing elements and local memory.
5.3 Validation Methodology
This section describes our methodology for validating MATI's energy, delay, and accuracy
beneﬁts. Validating MATI's beneﬁts is made challenging by its intrinsic analog mixed-signal
nature and by the fact that ML algorithms need to process large data sets in order to obtain
application-level accuracy. The key challenge lies in estimating the application-level metrics,
speciﬁcally accuracy, from component-level metrics in an eﬃcient manner.
Our methodology shown in Fig. 5.6 addresses these challenges by: (1) developing silicon-
validated energy, delay, and behavioral models of MATI components in a TSMC 65 nm GP
process including analog non-idealities, (2) incorporating the delay and behavioral mod-
els to develop the timing and behavior accurate component-level Verilog-A models, (3)
incorporating these component-level Verilog models into a timing and behavior accurate
architectural-level MATI Verilog model (this model is employed to ensure correct function-
119
Table 5.2: Energy and delay per operation (1 cycle = 1ns).
Class Operation
Delay
(# of cycles)
Energy/
Bank (pJ)
1
write 2 72.6
read 2 33.4
mr_read 7 61.4
sr_read 5 22.6
mr_subt 7 103.3
mr_add 7 103.3
2
avg 6 5.2
abs 6 12.1
square 8 37.9
sign_dot 14 16.4
unsign_dot 14 16.4
3 adc 2N+10 5.7
4
accumulation 4 ≈ 0
mean 3 ≈ 0
threshold 2 ≈ 0
max 4 ≈ 0
min 4 ≈ 0
sigmoid 3 ≈ 0
ReLu 3 ≈ 0
Leakage energy per cycle (1 ns) 0.6
CTRLer energy per cycle (1 ns) 5.4
ality and estimate accuracy over small data sets), and (4) developing a MATI C++ model
incorporating component-level behavioral models for verifying accuracy over large data sets.
5.3.1 Component-Level Models
MATI comprises both analog (BCA, BLP, CBLP, sampler, ADC) and digital components
(CTRLer, RDL). The entire analog chain was post-layout simulated in SPICE in TSMC
65 nm GP process to obtain the energy and delay values as shown in Table 5.2. The total
energy and delay for the analog blocks were compared with the measured results reported in
[39] and the diﬀerences were found to be within 10% and 9%, respectively, thereby validating
the energy and delay numbers shown in Table 5.2. In order to capture the behavior of
the analog components in the presence of analog non-idealities, we conducted Monte-Carlo
SPICE simulations of all the analog components. For example, the variation of the MR-
READ process due to transistor threshold voltage mismatch was obtained in this manner.
120
Table 5.3: Benchmarks for MATI Simulations [53,61,62].
Algorithm Application Database 
Data size
(pixels
/vector
length)
Problem
size
Instructions   Comments
Optimal
bit
precision
( ,  )
Matched
filtering
Event
(gun-shot)
detection
Gun-shot 
mono 
sound
8-bit
256
100 test vectors
Class 1: mr_read
Class 2: sign_dot
Class 3: adc
Class 4 : threshold
Filter
weights
Test
samples
5-bits
Template 
matching
(w/ L1 & L2)
Face
recognition
MIT-CBCL
8-bit
16×16
256
Candidates
Class 1: mr_subt
Class 2: L1: abs, 
L2: square
Class 3: adc
Class 4: min
Candidate
faces
Test
Samples
Nearest candidate
based on either
L1 or L2 distance
6-bits
Linear SVM
Face
detection
MIT-CBCL
8-bit
16×16
2 classes, 2000 
training samples, 
858 test samples
Class 1: mr_read
Class 2: sign_dot
Class 3: adc
Weights
Test
Samples
Face data converted into a
vector, linear SVM
applied on it
6-bits
k-NN
(w/ L1 & L2)
Hand-written
character
recognition
MNIST
8-bit
16×16
10 Classes , 54210 
training samples,
200 test samples
Class 1: mr_subt
Class 2: L1: abs,
L2: square
Class 3: adc
Training 
samples
Test
Samples
Sorting is done in external 
processor after initial
processing on MATI
6-bits
Feature 
extraction 
(PCA)
Face
detection
MIT-CBCL 
8-bit
16×16
2000 samples
Class 1: mr_read
Class 2: sign_dot
Class 3: adc
Weights Samples
4 features used for
face detection based on PCA
6-bits
Linear 
regression
Modeling
linear
predictor
Synthetic
data
8-bit
2 dim.
8192 samples
Class 1: mr_read
Class 2: T1,T2: avg, 
T3: square,
T4: sign_dot
Class 3: adc
Class 4: accumulation
T1: 
T2:  
T3: 
T4: 
T4:  
2-D linear regression : 
                   
       
Reformulated as :
             
        
 -                    
6-bits
Deep neural 
network 
(DNN)
Hand-written
character
recognition
MNIST
8-bit
22×23
10 Classes , 60000 
training samples,
10000 test 
samples
Class 1: mr_read
Class 2: sign_dot
Class 3: adc
Weights
Test
Samples
4 Layer DNN with nodes as 
follows:
506-512-256-10
6-bits
The impact of charge injection and coupling noise in MR-READ, BLP, and CBLP stages
were captured via post-layout simulations. Behavioral models incorporating these non-ideal
analog eﬀects were extracted from SPICE data in the form of look-up tables (LUTs). These
behavioral LUTs and delays from Table 5.2 were then incorporated into timing and behavior
accurate component-level Verilog-A models.
Verilog models of all the digital components including CTRLer and RDL bank were devel-
oped and synthesized with the TSMC 65 nm GP library via the Synopsys Design Compiler.
Cadence Encounter was used to place and route the synthesized netlist. Gate-level Verilog
simulations were performed using the Standard Delay Format (SDF) ﬁle incorporating post-
layout delay information. Finally, the Value Change Dump (VCD) ﬁle obtained from Verilog
simulations was employed to obtain energy estimates through the Cadence Encounter.
121
Digital processor
SRAM
D D D D
( )
-bits
SA SA SA SA
( ) 
bitcell array
L:1 L:1 L:1 L:1
pipeline
latch
Figure 5.7: Reference architecture (L = 4, Nrow = 512, Ncol = 256) showing a self-timed
SRAM communicating with a single-function digital processor over a pipelined interface.
The digital processor is synthesized to have a critical path delay Tdp less than memory
access delay TSRAM so that the latter dominates the throughput.
5.3.2 Application-Level Validation
Verilog and Verilog-A models described in Section 5.3.1 were integrated to obtain a cycle
and functionally accurate MATI Verilog model. This model was executed on small data
sets, e.g., RPT_NUM <10, to ensure that: (1) operations in all the classes are completed
within the period TPi, (2) the digital blocks generate the correct CTRL signals at the right
time in the presence of post-layout parasitics when presented with the appropriate MATI
instruction word, and (3) inter-block interfaces including digital CTRLers-to-analog and
analog-to-analog are correctly synchronized.
In addition, a functional MATI C++ model incorporating the LUT-based analog behav-
ioral models described in Section 5.3.1 was also developed. This C++ model was run on
a large data sets to obtain MATI's application-level accuracy, and to validate the MATI
Verilog model by comparing their outputs on a small data sets. Finally, a C++ model of
the reference digital architecture (see Section 5.3.3) was developed in order to obtain its
application-level accuracy. This model's output was compared with the MATI C++ model
output in order to estimate the impact of MATI's analog non-idealities.
122
5.3.3 Reference Architecture
We evaluate the energy, speed-up, and accuracy of MATI in one and four bank scenarios
with respect to: (1) the reference digital architecture (Fig. 5.7) with minimum bit precision
required per algorithm (in the last column of Table 5.3) (CONV-OPT), (2) CONV with
8-bit precision (CONV-8b), (3) CM [39], and (4) pipe-CM (pipelined version of CM).
The digital architecture (Fig. 5.7) consists of a standard SRAM with the pipelined interface
to a ﬁxed-function digital processor. In this chapter, pipe-CM was emulated using the
MATI component models. These ML applications are memory-bound (i.e., there is little
data reuse of Wjs during inference, as in many cases each Wj stored in SRAM is used only
once per given input X). Thus, we use the same number of banks for both MATI and the
digital ASIC to provide the same external bandwidth via the conventional SRAM interface
(1-bank and 4-banks scenarios with a 512×256 array). Note that MATI can employ more
parallel computations, since its in-memory computation is not limited by the external SRAM
interface.
On the other hand, the operand X is stored in the ODB and reused for both architectures.
We assume the SRAM is self-timed as doing so minimizes the read access delay [59, 60] of
the reference architecture. This delay was found to be 2 ns/64-bit fetch via post-layout
simulations of the SRAM. The SRAM read access energy was obtained from post-layout
simulations and found to be consistent with that in [39].
5.3.4 Benchmarks
The commonly employed ML algorithms listed in Table 5.3 were mapped to MATI. As MATI
is programmable, it employs 8-bit input data to cover diverse applications and algorithms.
This value of precision has been employed in many other implementations [5,56,57,86]. For
simplicity, Bx = Bw is assumed in this chapter. Though MATI's accuracy improves when
processing high-dimensional vectors as the aggregation step in CBLP has a noise-averaging
eﬀect, we assume a conservatively small vector dimension N = 256.
123
5.4 Evaluation Results
MATI executes 128-dimensional vector operation within TP , i.e., its throughput is fMATI =
128/TP . On the other hand, the energy consumption is estimated as follows:
EMATI =
4∑
i=1
EClass,i + ELEAK + ECTRL (5.3)
where EClass,i is the energy consumed by Class,i instruction, ECTRL and ELEAK are the
controller and leakage energies, respectively. From Table 5.2, the energies of Classes 1 and
2 dominate. We ignore the CTRLer energy in the reference architecture but include it in
MATI's architecture.
5.4.1 Comparison with CONV
The reference architecture fetches NCOL/L-bits per access (access time: TSRAM), and thus
the throughput is as follows:
fCONV =
(
NCOL/L
Bw
)(
1
TSRAM
)
(5.4)
where the throughput is limited by the memory access. Figure 5.8(a) shows that MATI
provides a speed-up of 1.4×-to-3.4× compared to CONV-OPT across the benchmarks for
a single bank scenario. MATI's speed-up is the least for linear regression because it needs
to re-access the same SRAM data every task because analog data cannot be stored due to
leakage. CONV would store the data in a local register and reuse it. Since increasing the
number of banks does not impact the relative speed-up (but improves throughput of all the
architectures proportionally), we put the results for the single bank only.
Figure 5.8(b) shows that MATI achieves a 2.5×-to-4× energy savings compared to CONV-
OPT in the single bank case (up to 5.5× in the four bank case) thereby leading to an energy-
delay product (EDP) improvements ranging from 3.4×-to-12.6× compared to CONV-OPT.
MATI's energy savings with respect to CONV-8b is less than CM in [39] for SVM, template
matching, k-NN and matched ﬁltering. This is because the self-timed CONV-8b in this
124
00.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Match. Filt. Temp.
Match. L1
Temp.
Match. L2
Linear SVM k-NN L1 k-NN L2 PCA Linear Reg. DNN
Sp
ee
d
u
p
Throughput ratio ( MATI/CONV-8b ) Throughput ratio ( MATI/CONV-OPT )
(a)
0
1
2
3
4
5
6
7
8
9
Match. Filt. Temp.
Match. L1
Temp.
Match. L2
Linear SVM k-NN L1 k-NN L2 PCA Linear Reg. DNN
En
er
gy
 r
ed
u
ct
io
n
Energy ratio: 1-Bank ( CONV-8b/MATI ) Energy ratio: 4-Bank ( CONV-8b/MATI )
Energy ratio: 1-Bank ( CONV-OPT/MATI ) Energy ratio: 4-Bank ( CONV-OPT/MATI )
(b)
Figure 5.8: MATI (with SWING = 111) compared to CONV in terms of: (a) speed-up,
and (b) energy savings.
125
01
2
3
4
5
6
7
8
CONV-8b MATI CONV-8b MATI CONV-8b MATI
SVM Temp. Match. L1 Temp. Match. L2
En
er
gy
  p
er
 w
o
rd
 (
p
J)
READ COMPUTATION CTRLer
Figure 5.9: Energy breakdown of MATI (with SWING = 111 in single bank) with respect
to CONV-8b.
chapter (2 ns/64-bits) is faster than the conventional SRAM in [39] (9 ns/64-bits), enabling
CONV-8b to go into sleep mode faster, thereby reducing leakage energy. From Fig. 5.9, it
is clear that the key reason for MATI's energy eﬃciency is due to its MR-READ (Class 1)
and BLP/CBLP (Class 2) instructions executing in the low-swing analog.
5.4.2 Comparison with CM
We compare MATI with CM and pipe-CM in order to estimate the overhead due to pro-
grammability for two algorithms: SVM and template matching (L1). The energy and
throughput of CM and pipe-CM are estimated from Table 5.2 and Fig. 5.2. Figure 5.10
shows that both MATI and pipe-CM achieve a speed-up of 3.8× over CM. The reason why
MATI does not suﬀer from a throughput penalty as compared to pipe-CM is due to its use
of DTP assignment. This is clearly seen for template matching where MATI without DTP
leads to a loss of 2× in throughput when compared to pipe-CM. The throughput gain of
pipe-CM was found to result in a 5.5% energy savings over CM due to reduced leakage. In
spite of its operational diversity, MATI's energy overhead compared to pipe-CM is < 0.1%
as the BLP in both activates only one operation, and it leads to a 4.5% area overhead as
compared to CM, as the BCA in both dominates the area.
126
00.5
1
1.5
2
2.5
3
3.5
4
4.5
SVM Template Matching
Sp
ee
d
u
p
CM pipe-CM MATI (w/o DTP) MATI
2 ×
Figure 5.10: Speed-up factors of pipe-CM, MATI without DTP, and MATI (with DTP)
over CM [39].
En
er
gy
 p
er
 w
o
rd
 (p
J)
D
et
ec
ti
o
n
 A
cc
u
ra
cy
 (
  
 
 ) CONV accuracy (32×32)
SWING
(a)
En
er
gy
 p
er
 w
o
rd
 (p
J)
D
et
ec
ti
o
n
 A
cc
u
ra
cy
 (
  
 
 )
CONV accuracy
(32×32)
SWING
(b)
En
er
gy
 p
er
 w
o
rd
 (p
J)
D
et
ec
ti
o
n
 A
cc
u
ra
cy
 (
  
 
 )
CONV accuracy (1024)
SWING
(c)
En
er
gy
 p
er
 w
o
rd
 (p
J)
D
et
ec
ti
o
n
 A
cc
u
ra
cy
 (
  
 
 )
CONV accuracies
SWING
(d)
Figure 5.11: Accuracy vs. energy trade-oﬀ of MATI enabled via the SWING parameter
for: (a) k-NN, (b) template matching (L1), (c) matched ﬁltering, and (d) DNN. The
horizontal dotted line marks the accuracy of CONV-OPT for reference.
127
5.4.3 Energy vs. Accuracy Trade-oﬀ
MATI's analog processing provides an interesting energy vs. accuracy trade-oﬀ at the
application level. This is achieved via the SWING parameter and the vector size N in
the ISA, as shown in Fig. 5.11. Here, the detection accuracy pdet, deﬁned as pdet =
(# of correct decisions)/(total # of decisions), with energy/word (total energy / number
of words processed per decision) is plotted for template matching, k-NN, matched ﬁltering,
and DNN algorithms. We see that multi-category classiﬁcation algorithms such as template
matching need a higher SWING value, and hence more energy, to achieve the same accuracy.
Additionally, accuracy improves for the same SWING value as N increases. This clearly in-
dicates the noise averaging eﬀect of CBLP, and enables one to reduce energy as a function
of N . Finally, in case of a DNN, the accuracy is less sensitive to SWING as the number
of hidden layers increases. This trade-oﬀ can be exploited to reduce the voltage swing and
obtain energy savings as a function of the network size. In practice, energy-optimal values of
SWING and ADC_PREC can be obtained by employing the training set in a calibration
mode. Potentially, this process can be automated by the compiler via auto-tuning [87].
Figure 5.11 also predicts MATI's accuracy in advanced process technology nodes, where a
drop in BL swing has a similar eﬀect on the accuracy of its analog computations as process
variations. It shows that MATI's pdet degrades gracefully for all benchmarks in spite of a
50% drop in BL swing (SWING = 111 to 100).
5.5 Conclusion
This chapter has proposed MATI, a programmable ISA based on the multi-functional DIMA
called Compute Memory (CM) [39]. Though, we successfully mapped nine popular ML
algorithms across computer vision and deep learning applications on MATI, we believe many
more algorithms such as the recurrent neural network, random forest, and sparse distributed
memory can be mapped as well with minor modiﬁcation in the instruction set and supporting
blocks. This work also sets the stage for developing compiler support for MATI and for
extending MATI to address a diverse set of large-scale applications.
128
Chapter 6
CONCLUSIONS AND FUTURE WORK
Emerging applications such as in health care, social networks, smart infrastructure, surveil-
lance/monitoring and others leverage the ubiquitous presence of sensing and data storage
to generate massive data volumes. These data sets are being subjected to ML techniques
to extract informative patterns of interest. The implementation of energy-eﬃcient ML in
silicon is made challenging especially as such algorithms require processing of large data
volumes as reported in recent IC implementations [4, 20, 88]. While much of work in the
area of energy-eﬃcient ML accelerators has focused on digital architectures, this disserta-
tion explores a unique alternative - the deep in-memory architecture (DIMA) - where high
energy and latency costs of data movement between processor and memory are addressed
by embedding mixed-signal computations deeply into the periphery of the memory core.
6.1 Dissertation Contributions
This dissertation ﬁrst identiﬁes the common functional ﬂow across a diverse set of ML
algorithms and maps it to the sequential ﬂow of the four processing stages in DIMA: (1)
functional read (FR)  data access, (2) bitline processing (BLP)  element-wise distance
calculations, (3) cross BLP (CBLP)  aggregation of vector elements, and (4) thresholding 
decision making. This dissertation also presents a system-level rationale to show the DIMA's
robustness despite low-SNR processing caused by tightly pitch-matched analog processing.
This is explained in terms of three principles: delayed decision, non-uniform bit protection,
and aggregation.
129
Design guidelines are provided for key parameters such as pulse width and amplitude
for WL enabling signal in the FR stage, capacitor size for BLP and CBLP stages based
on the analysis of various noise sources such as process variation, charge injection, thermal
noise, and coupling noise. Design techniques such as sub-ranged read and processing, replica
bitcell, and shielding line are also provided for further improvement of accuracy, energy, and
throughput.
Energy, delay, and mixed-signal circuit behavioral models are introduced to predict the
energy and delay trends and application level's accuracy as a function of major design
parameters. The HSPICE simulations in a 65 nm process show that the maximum error
of the behavioral models is 4.4% of the dynamic range of CBLP output, but the relative
magnitude of outputs was maintained. These models are expected to be suﬃciently accurate
as the relative vector distance is important in the ML algorithms. The predicted accuracy
of pattern matching in terms of probability of detection shows a maximum prediction error
of 0.066 with vector dimension 256.
The multi-functional DIMA prototype IC was successfully fabricated with 16 KB SRAM
array in a 65 nm process achieving up to 31× (56× in multi-bank scenario) smaller energy-
delay product with 5.8× and 5.4× smaller delay and energy, respectively, in four algorithms:
support vector machine, template matching, k-nearest neighbor, and matched ﬁlter. The
prototype IC of RF also achieves a 3.1Ö energy savings and 2.2Ö speed-up at the same
time providing a 6.8Ö lower EDP at the same accuracy of > 93% compared to conventional
digital architecture, leading to a throughput of 364 K decisions/s and energy eﬃciency of
19.4 nJ/decision for an eight-class traﬃc sign recognition problem. Measurement results
also show the clear trade-oﬀ between the application accuracy and energy consumption by
changing the BL voltage swing ∆VBL, which is the dominant controlling knob of DIMA's
energy.
The DIMA's beneﬁts were also proven in more complex ML algorithms. The DIMA-based
CNN achieves roughly 5× energy saving mostly by the low-power inner product computation
and the reduced leakage energy due to high throughput. In conclusion, 24.5× smaller EDP
is achieved as compared to the conventional system with 0.02% larger error rate in a 10-class
handwritten character recognition problem.
130
This dissertation also put an eﬀort to co-optimize the algorithm and DIMA's architecture
rather than simply adopting the DIMA platform. As a result, circuit and system simulations
in a 65 nm CMOS process show that the DIMA-based SDM reduces energy and delay
simultaneously by a factor of up to 25Ö and 12Ö, respectively, over the conventional SDM
architecture in the auto- and hetero-associative modes with negligible accuracy loss (≤0.4%).
This dissertation also extended DIMA to a programmable ISA called MATI. Employing
silicon-validated energy, delay and behavioral models of deep in-memory components, we
demonstrate that MATI is able to realize nine ML benchmarks while incurring negligible
overhead in energy (< 0.1%), area (4.5%), and throughput over ﬁxed-function DIMA [39].
In this process, MATI is able to simultaneously achieve enhancements in both energy (2.5×
to 5.5×) and throughput (1.4× to 3.4×) for an overall EDP improvement of up to 12.6×
over ﬁxed-function digital architectures.
6.2 Future Work
The DIMA platform has strong future potential as described as follows.
6.2.1 Extension DIMA to Other Memory Technology
In this dissertation, an SRAM has been focused to demonstrate the beneﬁts of the DIMA
platform as it is closest to the processor in the memory hierarchy. However, the DIMA
paradigm has strong potential for other memory technologies. This is especially true for high-
density storage systems such as NAND and NOR ﬂash, where the cost of data movement
is signiﬁcantly high. However, in the storage systems, the pitch-matching constraint is
expected to be more severe due to the smaller memory bitcell dimension (e.g., 4F2). In
addition, those memory systems rely on current-based sensing schemes whereas an SRAM
exploits the BL voltage swing. The serially connected NAND array structure presents an
additional challenge as the FR mechanism in this dissertation assumes the parallel-connected
bitcell array. Therefore, circuit innovations are required for novel FR and BLP stages to
overcome the diﬀerence from SRAM topology. The FR operation along with multi-level-cell
131
(MLC)-based NAND ﬂash can be a great opportunity to achieve further throughput beneﬁt.
Extensions to emerging memory technologies such as PRAM and RRAM are expected to be
the natural next step as those use the diﬀerence in the threshold voltage and resistance of
bitcells, respectively, similar to the ﬂash memory.
6.2.2 Circuit-Level Techniques for Enhanced Read, Write, and Processing
The DIMA achieves signiﬁcant energy and throughput beneﬁts during the read operations.
This is the optimal choice for the ML inference, where a large number of memory accesses
are required whereas the write operation is infrequent. On the other hand, the ML training
requires not only reading but also writing to update the coeﬃcients. An energy-eﬃcient
write technique is needed to apply the DIMA with on-chip training.
The current FR processes 4 bits per BL and covers 8-bit precision with the sub-range
read technique. The sub-range read causes signiﬁcant (e.g., 2×) throughput degradation,
limiting bit precision. New FR mechanism or sub-ranged read is expected to address the
limited bit-precision issue. The separate read and write paths of the 8T SRAM bitcell can
be an opportunity to open the new FR mechanism.
The DIMA assumes that computations require two operands: (1) digital data stored in
the memory and (2) streamed-in digital data in the operand buﬀer. However, many systems
accept the sensory input as analog data and converts it into digital format through expensive
analog digital conversion (ADC) process. If the DIMA can accept both analog and digital
inputs, it will open broad opportunities for the sensory systems.
Although most ML algorithms require the aggregation step for dimensionality-reduction,
there are several exceptions, such as belief propagation-based algorithms [89]. These al-
gorithms will require cascaded BLP stages, where input and output dynamic ranges and
accumulated noise sources need to be managed by novel analog circuit techniques.
132
6.2.3 Error Resiliency
The robustness of DIMA heavily relies on the inherent error resiliency from the aggregation,
where random noise sources are eﬀectively averaged out. However, a trade-oﬀ between the
energy and accuracy was demonstrated by the measurement results in Chapter 3. This
indicates that there is a potential to push DIMA's energy savings further by statistical error
compensation techniques. The techniques are particularly required when the number of
elements in the aggregation is not large enough or unreliable emerging device technology is
employed.
Employing some redundancy in the stored data will be one of the straightforward ap-
proaches for the error-resiliency. The redundancy can be applied for the bit-level by storing
the MSB bits in multiple locations. Alternatively, the redundancy can be applied to the
important features after the feature extraction by principle component analysis (PCA) or
non-negative matrix factorization (NMF). The DIMA with an on-chip trainer will be an
interesting direction to handle both inter- and intra-chip variations with aggressively low-
SNR operations. It will be also beneﬁcial to manage the transient noise sources such as
temperature, input statistics, and supply voltage ﬂuctuation.
6.2.4 Architecture
The programmable DIMA, MATI, gives software the ﬂexibility to trade-oﬀ application ac-
curacy for latency and energy with the bit precision for ADC output and WL voltage level
(VWL) to control BL swing, respectively. In this dissertation, energy-optimal values of these
hardware-level parameters can be obtained with the training data set in a calibration stage.
However, we envisage that compiler or auto tuner techniques can be used to map high-level
application metrics (e.g., accuracy of decisions) to the parameters. The compiler-based au-
tomation techniques can be extended to optimize the sequence of instructions to minimize
the BL discharge and the movement of operands. It will be also an interesting direction
to combine the ﬂash and SRAM-based DIMA architectures for synergistic use cases. Fur-
ther optimized architectures for large neural network such as AlexNet and long short-term
memory (LSTM) can be considered as well.
133
REFERENCES
[1] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrit-
twieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., Mastering the game of
Go with deep neural networks and tree search, Nature, vol. 529, no. 7587, pp. 484489,
2016.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classiﬁcation with deep convo-
lutional neural networks, in Advances in Neural Information Processing Systems, 2012,
pp. 10971105.
[3] J. Baliga, R. W. Ayre, K. Hinton, and R. S. Tucker, Green cloud computing: Balancing
energy in processing, storage, and transport, Proceedings of the IEEE, vol. 99, no. 1,
pp. 149167, 2011.
[4] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, Eyeriss: An energy-eﬃcient reconﬁg-
urable accelerator for deep convolutional neural networks, IEEE Journal of Solid-State
Circuits, vol. 52, no. 1, pp. 127138, 2017.
[5] H. Kaul, M. A. Anders, S. K. Mathew, G. Chen, S. K. Satpathy, S. K. Hsu, A. Agarwal,
and R. K. Krishnamurthy, A 21.5 M-query-vectors/s 3.37 nJ/vector reconﬁgurable k-
nearest-neighbor accelerator with adaptive precision in 14nm tri-gate CMOS, in 2016
IEEE International Solid-State Circuits Conference (ISSCC), 2016, pp. 260261.
[6] S. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J. Yoo, A 1.93 TOPS/W scalable
deep learning/inference processor with tetra-parallel MIMD architecture for big-data
applications, in 2015 IEEE International Solid-State Circuits Conference-(ISSCC) Di-
gest of Technical Papers, 2015, pp. 13.
[7] K. Kim, S. Lee, J.-Y. Kim, M. Kim, and H.-J. Yoo, A 125 GOPS 583 mW network-on-
chip based parallel processor with bio-inspired visual attention engine, IEEE Journal
of Solid-State Circuits, vol. 44, no. 1, pp. 136147, 2009.
[8] J.-Y. Kim, M. Kim, S. Lee, J. Oh, K. Kim, and H.-J. Yoo, A 201.4 GOPS 496 mW
real-time multi-object recognition processor with bio-inspired neural perception engine,
IEEE Journal of Solid-State Circuits, vol. 45, no. 1, pp. 3245, 2010.
[9] J. Oh, G. Kim, B.-G. Nam, and H.-J. Yoo, A 57 mW 12.5 µJ/Epoch embedded mixed-
mode neuro-fuzzy processor for mobile real-time object recognition, IEEE Journal of
Solid-State Circuits, vol. 48, no. 11, pp. 28942907, 2013.
134
[10] M. Price, J. Glass, and A. P. Chandrakasan, A scalable speech recognizer with deep-
neural-network acoustic models and voice-activated power gating, in Solid-State Cir-
cuits Conference (ISSCC), 2017 IEEE International, 2017, pp. 244245.
[11] P. N. Whatmough, S. K. Lee, H. Lee, S. Rama, D. Brooks, and G.-Y. Wei, A 28nm SoC
with a 1.2 GHz 568nJ/prediction sparse deep-neural-network engine with >0.1 timing
error rate tolerance for IoT applications, in Solid-State Circuits Conference (ISSCC),
2017 IEEE International, 2017, pp. 242243.
[12] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, Envision: A 0.26-
to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable convolu-
tional neural network processor in 28nm FDSOI, in Solid-State Circuits Conference
(ISSCC), 2017 IEEE International, 2017, pp. 246247.
[13] K. Bong, S. Choi, C. Kim, S. Kang, Y. Kim, and H.-J. Yoo, A 0.62 mw ultra-low-
power convolutional-neural-network face-recognition processor and a CIS integrated
with always-on haar-like face detector, in Solid-State Circuits Conference (ISSCC),
2017 IEEE International, 2017, pp. 248249.
[14] R. Koontz, Dron applications, http://www.kidsdiscover.com/teacherresources/drones-
uavs-rescue.
[15] PRWeb, Stretch sensors, http://www.prweb.com/releases/2014/03/prweb11661093.htm.
[16] IMEC, EEG headset, https://www.pinterest.com/pin/71705819044534518.
[17] Auburn Fire Department, FOX NEWS, http://insider.foxnews.com/2015/07/03/watch-
drone-helps-deliver-life-jacket-boys-stranded-maine-river.
[18] MC10, Flexible sensors, https://www.mc10inc.com/our-products/biostamprc.
[19] M. Horowitz, Computing's energy problem (and what we can do about it), in IEEE
Int. Solid-State Circuits Conf. (ISSCC), February 2014, pp. 1014.
[20] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, Diannao: A
small-footprint high-throughput accelerator for ubiquitous machine-learning, in ACM
Sigplan Notices, vol. 49, no. 4, 2014, pp. 269284.
[21] J. Backus, Can programming be liberated from the von Neumann style?: A functional
style and its algebra of programs, Communications of the ACM, vol. 21, no. 8, pp.
613641, 1978.
[22] A. Firoozshahian, Smart Memories: A Reconﬁgurable Memory System Architecture.
ProQuest, 2009.
[23] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz, Smart memo-
ries: A modular reconﬁgurable architecture, in ACM SIGARCH Computer Architecture
News, vol. 28. ACM, 2000, pp. 161171.
135
[24] K. Mai, R. Ho, E. Alon, D. Liu, Y. Kim, D. Patil, and M. A. Horowitz, Architecture
and circuit techniques for a 1.1-GHz 16-kb reconﬁgurable memory in 0.18-µm CMOS,
vol. 40, no. 1, pp. 261275, 2005.
[25] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis,
R. Thomas, and K. Yelick, Intelligent ram (IRAM): Chips that remember and com-
pute, in Solid-State Circuits Conference, 1997. Digest of Technical Papers. 43rd
ISSCC., 1997 IEEE International. IEEE, 1997, pp. 224225.
[26] Y.-H. Chen, J. Emer, and V. Sze, Eyeriss: A spatial architecture for energy-eﬃcient
dataﬂow for convolutional neural networks, in International Symposium on Computer
Architecture (ISCA), 2016, pp. 367379.
[27] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin,
K. Flautner et al., RAZOR: A low-power pipeline based on circuit-level timing specu-
lation, in Microarchitecture, 2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM
International Symposium on, 2003, pp. 718.
[28] J. P. Kulkarni, K. Kim, and K. Roy, A 160 mv robust Schmitt trigger based sub-
threshold SRAM, IEEE Journal of Solid-State Circuits, vol. 42, no. 10, pp. 23032313,
2007.
[29] B. Zhai, D. Blaauw, D. Sylvester, and S. Hanson, A sub-200mv 6T SRAM in 0.13
µm cmos, in Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical
Papers. IEEE International. IEEE, 2007, pp. 332606.
[30] F. Frustaci, M. Khayatzadeh, D. Blaauw, D. Sylvester, and M. Alioto, SRAM for
error-tolerant applications with dynamic energy-quality management in 28 nm CMOS,
IEEE Journal of Solid-State Circuits, vol. 50, no. 5, pp. 13101323, 2015.
[31] F. Frustaci, D. Blaauw, D. Sylvester, and M. Alioto, Approximate SRAMs with dy-
namic energy-quality management, IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 24, no. 6, pp. 21282141, 2016.
[32] H. J. Mattausch, T. Gyohten, Y. Soda, and T. Koide, Compact associative-memory
architecture with fully parallel search capability for the minimum Hamming distance,
vol. 37, no. 2, pp. 218227, 2002.
[33] Y. Oike, M. Ikeda, and K. Asada, A high-speed and low-voltage associative co-processor
with exact Hamming/Manhattan-distance estimation using word-parallel and hierarchi-
cal search architecture, vol. 39, no. 8, pp. 13831387, 2004.
[34] R. Genov and G. Cauwenberghs, Kerneltron: Support vector "machine" in silicon,
vol. 14, no. 5, pp. 14261434, 2003.
[35] M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz, An energy-eﬃcient
VLSI architecture for pattern recognition via deep embedding of computation in
SRAM, in IEEE International Conference on Acoustics, Speech and Signal Process-
ing (ICASSP), May 2014, pp. 83268330.
136
[36] M. Kang, S. K. Gonugondla, M.-S. Keel, and N. R. Shanbhag, An energy-eﬃcient
memory-based high-throughput VLSI architecture for Convolutional Networks, in
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
May 2015.
[37] M. Kang, E. P. Kim, M.-S. Keel, and N. R. Shanbhag, Energy-eﬃcient and high
throughput sparse distributed memory architecture, in IEEE International Symposium
on Circuits and Systems (ISCAS), June 2015.
[38] M. Kang and N. R. Shanbhag, In-memory computing architectures for sparse dis-
tributed memory, IEEE Transactions on Biomedical Circuits and Systems, vol. 10,
no. 4, pp. 855863, 2016.
[39] M. Kang, S. Gonugondla, A. Patil, and N. Shanbhag, A 481pJ/decision 3.4M decision/s
multifunctional deep in-memory inference processor using standard 6T SRAM array,
arXiv preprint arXiv:1610.07501, 2016.
[40] J. Zhang, Z. Wang, and N. Verma, A machine-learning classiﬁer implemented in a
standard 6T SRAM array, in IEEE Symposium on VLSI Circuits (VLSI-Circuits),
2016, pp. 12.
[41] J. Zhang, Z. Wang, and N. Verma, In-memory computation of a machine-learning
classiﬁer in a standard 6T SRAM array, IEEE Journal of Solid-State Circuits, 2017.
[42] R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge, Near-
threshold computing: Reclaiming Moore's law through energy eﬃcient integrated cir-
cuits, Proceedings of the IEEE, vol. 98, no. 2, pp. 253266, 2010.
[43] R. Hegde and N. R. Shanbhag, Soft digital signal processing, IEEE Transactions on
Very Large Scale Integration (VLSI) Systems, vol. 9, no. 6, pp. 813823, 2001.
[44] K. J. Kuhn, Reducing variation in advanced logic technologies: Approaches to process
and design for manufacturability of nanoscale cmos, in Electron Devices Meeting, 2007.
IEDM 2007. IEEE International. IEEE, 2007, pp. 471474.
[45] D. Bankman and B. Murmann, An 8-bit, 16 input, 3.2 pj/op switched-capacitor dot
product circuit in 28-nm fdsoi cmos, in IEEE Asian Solid-State Circuits Conference
(A-SSCC), 2016, pp. 2124.
[46] S. Assefa, S. Shank, W. Green, M. Khater, E. Kiewra, C. Reinholm, S. Kamlapurkar,
A. Rylyakov, C. Schow, F. Horst et al., A 90nm CMOS integrated nano-photonics
technology for 25Gbps WDM optical communications applications, in Electron Devices
Meeting (IEDM), 2012 IEEE International, 2012, pp. 3338.
[47] M. Powell, S.-H. Yang, B. Falsaﬁ, K. Roy, and T. N. Vijaykumar, Gated-VDD: A
circuit technique to reduce leakage in deep-submicron cache memories, in Int. Symp.
Low Power Electronics and Design. (ISLPED), 2000, pp. 9095.
137
[48] H. Qin, Y. Cao, D. Markovic, A. Vladimirescu, and J. Rabaey, SRAM leakage sup-
pression by minimizing standby supply voltage, in International Symposium on Quality
Electronic Design (ISQED), 2004, pp. 5560.
[49] M. Yamaoka et al, A 300-MHz 25-µA/Mb-leakage on-chip SRAM module featuring
process-variation immunity and low-leakage-active mode for mobile-phone application
processor, vol. 40, no. 1, pp. 186194, 2005.
[50] A. Papoulis and S. U. Pillai, Probability, Random Variables, and Stochastic Processes.
Tata McGraw-Hill Education, 2002.
[51] C. E. Clark, The greatest of a ﬁnite set of random variables, Operations Research,
vol. 9, no. 2, pp. 145162, 1961.
[52] H. Jeon, Y.-B. Kim, and M. Choi, Oﬀset voltage analysis of dynamic latched compara-
tor, in IEEE International Midwest Symposium on Circuits and Systems (MWSCAS),
2011, pp. 14.
[53] Center for biologicaland computationallearning (CBCL) at MIT, 2000,
http://cbcl.mit.edu/software-datasets/index.html.
[54] M. Kang, S. Gonugondla, and N. R. Shanbhag, A 19.4 nJ/decision 364 K decisions/s
in-memory random forest classiﬁer in 6T SRAM array, in IEEE European Solid-State
Circuits Conference (ESSCIRC), 2017.
[55] F. Arnaud, F. Boeuf, F. Salvetti, D. Lenoble, F. Wacquant, C. Regnier, P. Morin,
N. Emonet, E. Denis, J. Oberlin et al., A functional 0.69 µm2 embedded 6t-sram bit
cell for 65nm cmos platform.
[56] D. Silver, A. Huang, and et al., Mastering the game of Go with deep neural networks
and tree search, Nature, vol. 529, pp. 484503, 2016.
[57] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, Deep learning with
limited numerical precision. in ICML, 2015, pp. 17371746.
[58] Z. Zhou, B. Pain, and E. R. Fossum, CMOS active pixel sensor with on-chip successive
approximation analog-to-digital converter, IEEE Transactions on Electron Devices,
vol. 44, no. 10, pp. 17591763, 1997.
[59] S. C. Chung, Circuits and methods of a self-timed high speed SRAM, Nov. 10 2015,
US Patent 9,183,897.
[60] E. Karl, Y. Wang, Y.-G. Ng, Z. Guo, F. Hamzaoglu, M. Meterelliyoz, J. Keane, U. Bhat-
tacharya, K. Zhang, K. Mistry et al., A 4.6 GHz 162 Mb SRAM design in 22 nm tri-
gate CMOS technology with integrated read and write assist circuitry, IEEE Journal
of Solid-State Circuits, vol. 48, no. 1, pp. 150158, 2013.
[61] Y. LeCun and C. Cortes, MNIST handwritten digit database, AT&T Labs. Available:
http://yann. lecun. com/exdb/mnist, 2010.
138
[62] Production Crate, Gun Shot Sounds, http://soundscrate.com/gun-related.
[63] L. Breiman, Random forests, Machine Learning, vol. 45, no. 1, pp. 532, 2001.
[64] J. Park, J. Kwon, J. Oh, S. Lee, J.-Y. Kim, and H.-J. Yoo, A 92-mw real-time traﬃc sign
recognition system with robust illumination adaptation and support vector machine,
IEEE Journal of Solid-State Circuits, vol. 47, no. 11, pp. 27112723, 2012.
[65] D. Strigl, K. Koﬂer, and S. Podlipnig, Performance and scalability of GPU-based con-
volutional neural networks, in IEEE Euromicro International Conference on Parallel,
Distributed and Network-Based Processing (PDP), February 2010, pp. 317324.
[66] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, CNP: An FPGA-based processor
for convolutional networks, in IEEE International Conference on Field Programmable
Logic and Applications (FPL), August 2009, pp. 3237.
[67] J. M. Cruz-Albrecht, M. W. Yung, and N. Srinivasa, Energy-eﬃcient neuron, synapse
and STDP integrated circuits. IEEE Trans. Biomedical Circuits and Systems, vol. 6,
no. 3, pp. 246256, 2012.
[68] S. Brink, S. Nease, P. Hasler, S. Ramakrishnan, R. Wunderlich, A. Basu, and B. Degnan,
A learning-enabled neuron array IC based upon transistor channel models of biological
phenomena, IEEE Trans. Biomedical Circuits and Systems, vol. 7, no. 1, pp. 7181,
2013.
[69] S. Ramakrishnan, R. Wunderlich, J. Hasler, and S. George, Neuron array with plastic
synapses and programmable dendrites, IEEE Trans. Biomedical Circuits and Systems,
vol. 7, no. 5, pp. 631642, 2013.
[70] V. Garg, R. Shekhar, and J. G. Harris, Spiking neuron computation with the time
machine, IEEE Trans. Biomedical Circuits and Systems, vol. 6, no. 2, pp. 142155,
2012.
[71] P. Kanerva, Sparse Distributed Memory. Cambridge Massachusetts: MIT Press, 1988.
[72] P. J. Denning, Sparse Distributed Memory. Research Institute for Advanced Computer
Science (NASA Ames Research Center), 1989.
[73] E. Lehtonen, J. H. Poikonen, M. Laiho, and P. Kanerva, Large-scale memristive asso-
ciative memories, IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
vol. 22, no. 3, pp. 562574, 2014.
[74] M. Lindell et al., Conﬁgurable sparse distributed memory hardware implementation,
in IEEE Int. Symp. Circuits and Systems. (ISCAS), 1991, pp. 30783081.
[75] J. Saarinen et al., VLSI architectures of sparse distributed memory, in IEEE Int.
Symp. Circuits and Systems. (ISCAS), 1991, pp. 30743077.
139
[76] J. D. Keeler et al, Notes on implementation of sparsely distributed memory, in NASA
Research Institute for Advanced Computer Science, August 1986.
[77] S.-I. Chien, I.-C. Kim, and D.-Y. Kim, Iterative autoassociative memory models for
image recalls and pattern classiﬁcations, in IEEE International Joint Conference on
Neural Networks (IJCNN), 1991, pp. 3035.
[78] I. Kim et al, High performance PRAM cell scalable to sub-20nm technology with below
4F 2 cell size, extendable to DRAM applications, in IEEE Symp. VLSI Technology
(VLSIT), 2010, pp. 203204.
[79] S. Aritome, Advanced Flash memory technology and trends for ﬁle storage application,
in Int. Electron Devices Meeting. (IEDM), 2000, pp. 763766.
[80] T. Kobayashi, K. Nogami, T. Shirotori, and Y. Fujimoto, A current-controlled latch
sense ampliﬁer and a static power-saving input buﬀer for low-power architecture,
vol. 76, no. 5, pp. 863867, 1993.
[81] A. Verma and B. Razavi, Frequency-based measurement of mismatches between small
capacitors, in IEEE Custom Integrated Circuits Conference (CICC), September 2006,
pp. 481484.
[82] K. Kim, H. Mahmoodi, and K. Roy, A low-power SRAM using bit-line charge-recycling
technique, in Int. Symp. Low Power Electronics and Design. (ISLPED), 2007, pp. 177
182.
[83] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen,
Pudiannao: A polyvalent machine learning accelerator, in ACM SIGARCH Computer
Architecture News, vol. 43, no. 1. ACM, 2015, pp. 369381.
[84] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, CNP: An FPGA-based processor for
convolutional networks, in International Conference on Field Programmable Logic and
Applications, 2009, pp. 3237.
[85] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and
O. Temam, DaDianNao: A machine-learning supercomputer, in Annual IEEE/ACM
International Symposium on Microarchitecture, Dec 2014, pp. 609622.
[86] B. Murmann, D. Bankman, E. Chai, D. Miyashita, and L. Yang, Mixed-signal circuits
for embedded machine-learning applications, in IEEE 49th Asilomar Conference on
Signals, Systems and Computers, 2015, pp. 13411345.
[87] C. Cummins, P. Petoumenos, Z. Wang, and H. Leather, Synthesizing benchmarks for
predictive modeling, in IEEE/ACM International Symposium on Code Generation and
Optimization (CGO), 2017, pp. 8699.
[88] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, Eyeriss: An energy-eﬃcient recon-
ﬁgurable accelerator for deep convolutional neural networks, in IEEE International
Solid-State Circuits Conference (ISSCC), 2016, pp. 262263.
140
[89] K. P. Murphy, Y. Weiss, and M. I. Jordan, Loopy belief propagation for approximate
inference: An empirical study, in Proceedings of the Fifteenth Conference on Uncer-
tainty in Artiﬁcial Intelligence. Morgan Kaufmann Publishers Inc., 1999, pp. 467475.
141
