TiM-DNN: Ternary in-Memory accelerator for Deep Neural Networks by Jain, Shubham et al.
1TiM-DNN: Ternary in-Memory accelerator for Deep
Neural Networks
Shubham Jain, Sumeet Kumar Gupta, Anand Raghunathan
School of Electrical and Computer Engineering, Purdue University
{jain130,guptask,raghunathan}@purdue.edu
Abstract—The use of lower precision has emerged as a popular
technique to optimize the compute and storage requirements of
complex Deep Neural Networks (DNNs). In the quest for lower
precision, recent studies have shown that ternary DNNs, which
represent weights and activations by signed ternary values, rep-
resent a promising sweet spot, and achieve accuracy close to full-
precision networks on complex tasks such as language modeling
and image classification. We propose TiM-DNN, a programmable,
in-memory accelerator that is specifically designed to execute
ternary DNNs. TiM-DNN supports various ternary representa-
tions including unweighted (-1,0,1), symmetric weighted (-a,0,a),
and asymmetric weighted (-a,0,b) ternary systems. TiM-DNN
is designed using TiM tiles — specialized memory arrays that
perform massively parallel signed vector-matrix multiplications
on ternary values with a single access. TiM tiles are in turn
composed of Ternary Processing Cells (TPCs), new bit-cells
that function as both ternary storage units and signed scalar
multiplication units. We evaluate an implementation of TiM-DNN
in 32nm technology using an architectural simulator calibrated
with SPICE simulations and RTL synthesis. TiM-DNN achieves
a peak performance of 114 TOPs/s, consumes 0.9W power,
and occupies 1.96mm2 chip area, representing a 300X and
388X improvement in TOPS/W and TOPS/mm2, respectively,
compared to a state-of-the-art NVIDIA Tesla V100 GPU. In com-
parison to popular DNN accelerators, TiM-DNN achieves 55.2X-
240X and 160X-291X improvement in TOPS/W and TOPS/mm2,
respectively. We compare TiM-DNN with a well-optimized near-
memory accelerator for ternary DNNs across a suite of state-of-
the-art DNN benchmarks including both deep convolutional and
recurrent neural networks, demonstrating 3.9x-4.7x improvement
in system-level energy and 3.2x-4.2x speedup.
I. INTRODUCTION
The advent of DNNs has drastically advanced the field of
machine learning by enabling super-human accuracies for
many cognitive tasks involved in image, video, and natural
language processing [1]. However, the high computation and
storage costs of DNNs severely limit their ubiquitous adoption
in energy and cost-constrained IoT devices [2].
The use of lower precision to represent the weights and
activations in DNNs is a promising technique for realizing
DNN inference (evaluation of pre-trained DNN models) on
energy-constrained platforms [3]–[14]. Reduced bit-precision
can lower all facets of energy consumption including compu-
tation, memory and interconnects. Current commercial hard-
ware [15], [16], includes widespread support for 8-bit and 4-bit
This work was supported in part by C-BRIC, one of six centers in JUMP, a
Semiconductor Research Corporation (SRC) program sponsored by DARPA.
fixed point DNN inference, and recent research has continued
the push towards even lower precision [4]–[12].
Recent studies [4]–[12], [17] suggest that among low-
precision networks, ternary DNNs represent a promising
sweet-spot in the tradeoff between efficiency and accuracy.
Figure 1 shows the reported accuracies of various state-of-the-
art binary [4]–[6], ternary [7]–[12], and full-precision (FP32)
networks on complex image classification (ImageNet [18])
and language modeling (PTB [19]) tasks. We observe that the
accuracy degradation of binary DNNs over the FP32 networks
can be considerable (5-13% on image classification, 150-
180 PPW [Perplexity Per Word] on language modeling). In
contrast, ternary DNNs achieve accuracy significantly better
than binary networks (and much closer to FP32 networks).
Motivated by these results, we focus on the design of a
programmable accelerator for realizing various state-of-the-art
ternary DNNs.
Fig. 1: Accuracy comparison of binary, ternary, and
full-precision (FP32) DNNs [4]–[12]
Ternary networks greatly simplify the multiply-and-
accumulate (MAC) operation that constitutes 95-99% of to-
tal DNN computations. Consequently, the amount of energy
and time spent on DNN computations can be drastically
improved by using lower-precision processing elements (the
complexity of a MAC operation has a super-linear relationship
with precision). However, when classical accelerator archi-
tectures (e.g., TPU and GPU) are adopted to realize ternary
DNNs, on-chip memory due to sequential (row-by-row) reads
and leakage in un-accessed rows becomes the energy and
performance bottleneck. In-memory computing [20]–[30] is
an emerging computing paradigm that overcomes memory
bottlenecks by integrating computations within the memory
ar
X
iv
:1
90
9.
06
89
2v
2 
 [c
s.L
G]
  3
0 S
ep
 20
19
2array itself, enabling much greater parallelism and reducing
the need to transfer data to/from memory. This work explores
in-memory computing in the specific context of ternary DNNs
and demonstrates that it leads to significant improvements in
performance and energy efficiency.
While several efforts have explored in-memory accelerators
in recent years, TiM-DNN differs in significant ways and is
the first to apply in-memory computing (massively parallel
vector-matrix multiplications within the memory array itself)
to ternary DNNs using a new CMOS based bit-cell. Many in-
memory accelerators use non-volatile memory (NVM) tech-
nologies such as PCM and ReRAM [20]–[25] to realize in-
memory dot product operations. While NVMs promise much
higher density and lower leakage than CMOS memories,
they are still an emerging technology with open challenges
such as large-scale manufacturing yield, limited endurance,
high write energy, and errors due to device and circuit-level
non-idealities [31], [32]. Other efforts have explored SRAM-
based in-memory accelerators for binary networks [26]–[30].
However, the restriction to binary networks is a significant
limitation as binary networks known to date incur a large
drop in accuracy as highlighted in Figure 1. Near-memory
accelerators for ternary networks [10], [33] have also been
proposed, but their performance and energy benefits are lim-
ited by the on-chip memory due to sequential (row-by-row)
reads. Extended SRAMs that perform bitwise binary oper-
ations in-memory need to be augmented with near-memory
logic to perform higher precision computations in a bit-serial
manner [34]. However, such an approach suffers from similar
bottlenecks, limiting efficiency. In contrast, we propose TiM-
DNN, a programmable in-memory accelerator that can realize
massively parallel signed ternary vector-matrix multiplications
per array access. TiM-DNN supports various ternary represen-
tations including unweighted (-1,0,1), symmetric weighted (-
a,0,a), and asymmetric weighted (-a,0,b) systems, enabling it
to execute a broad range of state-of-the-art ternary DNNs. This
is motivated by recent efforts [8] that show weighted ternary
systems can achieve improved accuracies.
The building block of TiM-DNN is a new memory cell
called the Ternary Processing Cell (TPC), which functions as
both a ternary storage unit and a scalar ternary multiplication
unit. Using TPCs, we design TiM tiles, which are specialized
memory arrays to execute signed ternary dot-product opera-
tions. TiM-DNN comprises of a plurality of TiM tiles arranged
into banks, wherein all tiles compute signed vector-matrix
multiplications in parallel.
In summary, the key contributions of our work are:
• We present TiM-DNN, a programmable in-memory ac-
celerator supporting various ternary representations in-
cluding unweighted (-1,0,1), symmetric weighted (-a,0,a),
and asymmetric weighted (-a,0,b) systems for realizing a
broad range of ternary DNNs.
• We propose a Ternary Processing Cell (TPC) that func-
tions as both ternary storage and a ternary scalar multipli-
cations unit and a TiM tile that is a specialized memory
array to realize signed vector-matrix multiplication oper-
ations with ternary values.
• We develop an architectural simulator for evaluating
TiM-DNN, with array-level timing and energy models
obtained from circuit-level simulations. We evaluate an
implementation of TiM-DNN in 32nm CMOS using a
suite of 5 popular DNNs designed for image classification
and language modeling tasks. A 32-tile instance of TiM-
DNN achieves a peak performance of 114 TOPs/s, con-
sumes 0.9W power, and occupies 1.96mm2 chip area,
representing a 300X improvement in TOPS/W compared
to a state-of-the-art NVIDIA Tesla V100 GPU [15]. In
comparison to low-precision accelerators [33], [34], TiM-
DNN achieves 55.2X-240X improvement in TOPS/W.
TiM-DNN also obtains 3.9x-4.7x improvement in system
energy and 3.2x-4.2x improvement in performance over
a well-optimized near-memory accelerator for ternary
DNNs.
II. RELATED WORK
In recent years, several research efforts have focused on
improving the energy efficiency and performance of DNNs
at various levels of design abstraction. In this section, we
limit our discussion to efforts on in-memory computing for
DNNs [20]–[30], [34]–[38].
TABLE I: Related work summary
Table I classifies prior in-memory computing efforts based
on the memory technology [CMOS and Non-Volatile Memory
(NVMs)] and the targeted precision. One group of efforts [20]–
[25] has focused on in-memory DNN accelerators using
emerging Non-Volatile memory (NVM) technology such as
PCM and ReRAM. Although NVMs promise density and
low leakage relative to CMOS, they still face several open
challenges such as large-scale manufacturing yield, limited
endurance, high write energy, and errors due to device and
circuit-level non-idealities [31], [32]. Efforts on SRAM-based
in-memory accelerators can be classified into those that target
binary [26]–[30] and high-precision [34]–[36] DNNs. Accel-
erators targeting binary DNNs [26]–[30] can execute mas-
sively parallel vector-matrix multiplication per array access.
However, the restriction to binary networks is a significant
limitation as binary networks known to date incur a large
drop in accuracy as highlighted in Figure 1. Efforts [34]–
[36] that target higher precision (4-8 bit) DNNs require
3multiple execution steps (array accesses) to realize signed dot-
product operations, wherein both weights and activations are
signed numbers. For example, Neural cache [34] computes
bitwise Boolean operations in-memory but uses bit-serial near-
memory arithmetic to realize multiplications, requiring several
array accesses per multiplication operation (and many more
to realize dot-products). Apart from in-memory computing
efforts, Table I also details efforts targeting near-memory
accelerators for ternary networks [10], [33]. However, the
efficiency of these near-memory accelerators is limited by the
on-chip memory, as they can enable only one memory row
per access. Further, none of these efforts support asymmetric
weighted (-a,0,b) ternary systems.
In contrast to previous proposals, TiM-DNN is the first spe-
cialized and programmable in-memory accelerator for ternary
DNNs that supports various ternary representations including
unweighted (-1,0,1), symmetric weighted (-a,0,a), and asym-
metric weighted (-a,0,b) ternary systems. TiM-DNN utilizes a
new CMOS based bit-cell (i.e., TPC) and enables multiple
memory rows simultaneously to realize massively parallel
in-memory signed vector-matrix multiplications with ternary
values per memory access, enabling efficient realization of
ternary DNNs. As illustrated in our experimental evalua-
tion, TiM-DNN achieves 3.9x-4.7x improvement in system-
level energy and 3.2x-4.2x speedup over a well-optimized
near-memory accelerator. In comparison to the near-memory
ternary accelerator [33], it achieves 55.2X improvement in
TOPS/W.
III. TIM-DNN ARCHITECTURE
In this section, we present the proposed TiM-DNN accelerator
along with its building blocks, i.e., Ternary Processing Cells
and TiM tiles.
A. Ternary Processing Cell (TPC)
To enable in-memory signed multiplication with ternary
values, we present a new Ternary Processing Cell (TPC) that
operates as both a ternary storage unit and a ternary scalar
multiplication unit. Figure 2 shows the proposed TPC circuit,
which consists of two cross-coupled inverters for storing two
bits (‘A’ and ‘B’), a write wordline (WLW ), two source lines
(SL1 and SL2), two read wordlines (WLR1 and WLR2) and
two bitlines (BL and BLB). A TPC supports two operations
- write and scalar ternary multiplication. A write operation is
performed by enabling WLW and driving the source-lines and
the bitlines to either VDD or 0 depending on the data. We can
write both bits simultaneously, with ‘A’ written using BL and
SL2 and ‘B’ written using BLB and SL1. Using both bits ‘A’
and ‘B’ a ternary value (-1,0,1) is inferred based on the storage
encoding shown in Figure 2 (Table on the top). For example,
when A=0 the TPC stores W=0. When A=1 and B=0 (B=1)
the TPC stores W=1 (W=-1).
A scalar multiplication in a TPC is performed between a
ternary input and the stored weight to obtain a ternary output.
The bitlines are precharged to VDD, and subsequently, the
ternary inputs are applied to the read wordlines (WLR1 and
Fig. 2: Ternary Processing Cell (TPC) circuit and encoding
scheme
WLR2) based on the input encoding scheme shown in Figure 2
(Table on the bottom). The final bitline voltages (VBL and
VBLB) depend on both the input (I) and the stored weight
(W). The table in Figure 3 details the possible outcomes of
the scalar ternary multiplication (W*I) with the final bitline
voltages and the inferred ternary output (Out). For example,
when W=0 or I=0, the bitlines remain at VDD and the output
is inferred as 0 (W*I=0). When W=I=±1, BL discharges
by a certain voltage, denoted by ∆, and BLB remains at
VDD. This is inferred as Out=1. In contrast, when W=-I=±1,
BLB discharges by ∆ and BL remains at VDD producing
Out=-1. The final bitline voltages are converted to a ternary
output using single-ended sensing at BL and BLB. Figure 3
depicts the output encoding scheme and the results of SPICE
simulation of the scalar multiplication operation with various
possible final bitline voltages. Note that, the TPC design uses
separate read and write paths to avoid read disturb failures
during in-memory multiplications.
Fig. 3: Scalar multiplication using a TPC
B. Dot-product computation using TPCs
Next, we extend the idea of realizing a scalar multiplication
using the TPC to a dot-product computation. Figure 4(a) illus-
trates the mapping of a dot-product operation (
∑L
i=1 Inp[i] ∗
W [i]) to a column of TPCs with shared bitlines. To compute,
4first the bitlines are precharged to VDD, and then the inputs
(Inp) are applied to all TPCs simultaneously. The bitlines (BL
and BLB) function as an analog accumulator, wherein the final
bitline voltages (VBL and VBLB) represent the sum of the
individual TPC outputs. For example, if ‘n/L’ and ‘k/L’ TPCs
have output 1 and -1, respectively, the final bitline voltages
are VBL = VDD − n∆ and VBLB = VDD − k∆. The bit-
line voltages are converted using Analog-to-Digital converters
(ADCs) to yield digital values ‘n’ and ‘k’. For the unweighted
encoding where the ternary weights are encoded as (-1,0,1),
the final dot-product is ‘n-k’. Figure 4(b) shows the sensing
circuit required to realize dot-products with unweighted (-
1,0,1) ternary system.
Fig. 4: Dot-product computation using TPCs: (a) Analog
accumulation using BL and BLB, (b) sensing circuit for
unweighted (-1,0,1) ternary system
We can also realize dot-products with a more general ternary
encoding represented by asymmetric weighted (-a,0,b) values.
Support for a more general ternary encoding is motivated by
recent efforts [8] that show weighted ternary systems can
achieve improved accuracies. Figure 5(a) shows the sens-
ing circuit that enables dot-product with asymmetric ternary
weights (−W2, 0,W1) and inputs (−I2, 0, I1). As shown, the
ADC outputs are scaled by the corresponding weights (W1 and
W2), and subsequently, an input scaling factor (Iα) is applied
to yield ‘Iα(W1*n-W2*k)’. In contrast to dot-products with
unweighted values, we require two execution steps to realize
dot-products with the asymmetric ternary system, wherein
each step computes a partial dot-product (pOut). Figure 5(b)
details these two steps using an example. In step 1, we choose
Iα=I1, and apply I1 and I2 as ‘1’ and ‘0’, respectively,
resulting in a partial output (pOut) given by pOut1 = I1(W1*n-
W2*k). In step 2, we choose Iα=-I2, and apply I1 and I2 as
‘0’ and ‘1’, respectively, to yield pOut2 = -I2(W1*n-W2*k).
The final dot-product is given by ‘pOut1+pOut2’.
To validate the dot-product operation, we perform a detailed
SPICE simulation to determine the possible final voltages at
BL (VBL) and BLB (VBLB). Figure 6 shows various BL states
(S0 to S10) and the corresponding value of VBL and ‘n’. Note
that the possible values for VBLB (‘k’) and VBL (‘n’) are
identical, as BL and BLB are symmetric. The state Si refers
to the scenario where ‘i’ out of ‘L’ TPCs compute an output
Fig. 5: (a) Sensing circuit for asymmetric weighted (-a,0,b)
ternary system, (b) Example dot-product computation with
asymmetric weighted ternary values
Fig. 6: Dot-product circuit simulation
of ‘1’. We observe that from S0 to S7 the average sensing
margin (∆) is 96mv. The sensing margin decreases to 60-
80mv for states S8 to S10, and beyond S10 the bitline voltage
(VBL) saturates. Therefore, we can achieve a maximum of 11
BL states (S0 to S10) with sufficiently large sensing margin
required for sensing reliably under process variations [29].
The maximum value of ‘n’ and ‘k’ is thus 10, which in turn
determines the number of TPCs (‘L’) that can be enabled
simultaneously. Setting L = nmax = kmax would be a
5Fig. 7: Ternary in-Memory processing tile
conservative choice. However, exploiting the weight and input
sparsity of ternary DNNs [9], [11], [12], wherein 40% or more
of the elements are zeros, and the fact that non-zero outputs
are distributed between ‘1’ and ‘-1’, we choose a design with
nmax = 8, and L = 16. Our experiments indicate that this
choice had no effect on the final DNN accuracy compared to
the conservative case. In this paper, we also evaluate the impact
of process variations on the dot-product operations realized
using TPCs, and provide the experimental results on variations
in Section V-F.
C. TiM tile
We now present the TiM tile, i.e., a specialized memory
array designed using TPCs to realize massively parallel vector-
matrix multiplications with ternary values. Figure 7 details
the tile design, which consists of a 2D array of TPCs, a
row decoder and write wordline driver, a block decoder, Read
Wordline Drivers (RWDs), column drivers, a sample and hold
(S/H) unit, a column mux, Peripheral Compute Units (PCUs),
and scale factor registers. The TPC array contains ‘L*K*N’
TPCs, arranged in ‘K’ blocks and ‘N’ columns, where each
block contains ‘L’ rows. As shown in the Figure, TPCs in the
same row (column) share wordlines (bitlines and source-lines).
The tile supports two major functions, (i) programming, i.e.,
row-by-row write operations, and (ii) a vector-matrix multipli-
cation operation. A write operation is performed by activating
a write wordline (WLW ) using the row decoder and driving
the bitlines and source-lines. During a write operation, ‘N’
ternary words (TWs) are written in parallel. In contrast, to
the row-wise write operation, a vector-matrix multiplication
operation is realized at the block granularity, wherein ‘N’ dot-
product operations each of vector length ‘L’ are executed in
parallel. The block decoder selects a block for the vector-
matrix multiplication, and RWDs apply the ternary inputs.
During the vector-matrix multiplication, TPCs in the same row
share the ternary input (Inp), and TPCs in the same column
produce partial sums for the same output. As discussed in
section III-B, accumulation is performed in the analog domain
using the bitlines (BL and BLB). In one access, TiM tile can
compute the vector-matrix product Inp.W, where Inp is a
vector of length L and W is a matrix of dimension LxN stored
in TPCs. The accumulated outputs at each column are stored
using a sample and hold (S/H) unit and get digitized using
PCUs. To attain higher area efficiency, we utilize ‘M’ PCUs
per tile (‘M’ < ‘N’) by matching the bandwidth of the PCUs
to the bandwidth of the TPC array and operating the PCUs
and TPC array as a two-stage pipeline. Next, we discuss the
TiM tile peripherals in detail.
Read Wordline Driver (RWD). Figure 7 shows the RWD
logic that takes a ternary vector (Inp) and block enable (bEN)
signal as inputs and drives all ‘L’ read wordlines (WLR1 and
WLR2) of a block. The block decoder generates the bEN
signal based on the block address that is an input to the TiM
tile. WLR1 and WLR2 are activated using the input encoding
scheme shown in Figure 2 (Table on the bottom).
Peripheral Compute Unit (PCU). Figure 7 shows the logic
for a PCU, which consists of two ADCs and a few small
arithmetic units (adders and multipliers). The primary function
of PCUs is to convert the bitline voltages to digital values
using ADCs. However, PCUs also enable other key functions
such as partial sum reduction, and weight (input) scaling
for weighted ternary encoding (-W2,0,W1) and (-I2,0,I1).
Although the PCU can be simplified if W2=W1=1 or/and
I2=I1=1, in this work, we target a programmable TiM tile
that can support various state-of-the-art ternary DNNs. To
further generalize, we use a shifter to support DNNs with
ternary weights and higher precision activations [9], [12].
The activations are evaluated bit-serially using multiple TiM
accesses. Each access uses an input bit, and we shift the
computed partial sum based on the input bit significance using
the shifter. TiM tiles have scale factor registers (shown in
Figure 7) to store the weight and the activation scale factors
6that vary across layers within a network.
D. TiM-DNN accelerator architecture
Figure 8 shows the proposed TiM-DNN accelerator, which has
a hierarchical organization with multiple banks, wherein each
bank comprises of several TiM tiles, an activation buffer, a
partial sum (Psum) buffer, a global Reduce Unit (RU), a Spe-
cial Function Unit (SFU), an instruction memory (Inst Mem),
and a Scheduler. The compute time and energy in Ternary
DNNs are heavily dominated by vector-matrix multiplications
which are realized using TiM tiles. Other DNN functions,
viz., ReLU, pooling, normalization, Tanh and Sigmoid are
performed by the SFU. The partial sums produced by different
TiM tiles are reduced using the RU, whereas the partial sums
produced by separate blocks within a tile are reduced using
PCUs (as discussed in section III-C). TiM-DNN has a small
instruction memory and a Scheduler that read instructions and
orchestrates operations inside a bank. TiM-DNN also contains
activation and Psum buffers to store activations and partial
sums, respectively.
Fig. 8: TiM-DNN accelerator architecture
Fig. 9: TiM-DNN mapping: Example
Mapping. DNNs can be mapped to TiM-DNN both temporally
and spatially. The networks that fit on TiM-DNN entirely are
mapped spatially, wherein the weight matrix of each convo-
lution (Conv) and fully-connected (FC) layer is partitioned
and mapped to dedicated (one or more) TiM tiles, and the
network executes in a pipelined fashion. In contrast, networks
that cannot fit on TiM-DNN at once are executed using the
temporal mapping strategy, wherein we execute Conv and
FC layers sequentially over time using all TiM tiles. The
weight matrix (W) of each CONV/FC layer could be either
smaller or larger than the total weight capacity (TWC) of TiM-
DNN. Figure 9 illustrates the two scenarios using an example
workload (vector-matrix multiplication) that is executed on
two separate TiM-DNN instances differing in the number of
TiM tiles. As shown, when (W < TWC) the weight matrix
partitions (W1 & W2) are replicated and loaded to multiple
tiles, and each TiM tile computes on input vectors in parallel.
In contrast, when (W > TWC), the operations are executed
sequentially using multiple steps.
IV. EXPERIMENTAL METHODOLOGY
In this section, we present our experimental methodology for
evaluating TiM-DNN.
TiM tile modeling. We perform detailed SPICE simulations
to estimate the tile-level energy and latency for the write
and vector-matrix multiplication operations. The simulations
are performed using 32nm bulk CMOS technology and PTM
models. We use 3-bit flash ADCs to convert bitline voltages
to digital values. To estimate the area and latency of digital
logic both within the tiles (PCUs and decoders) and outside
the tiles (SFU and RU), we synthesized RTL implementa-
tions using Synopsys Design Compiler and estimated power
consumption using Synopsys Power Compiler. We performed
the TPC layout (Figure 10) to estimate its area, which is
about 720F 2 (where F is the minimum feature size). We also
performed variation analysis to estimate error rates due to
incorrect sensing by considering variations in transistor VT
(σ/µ=5%) [39].
Fig. 10: Ternary Processing Cell (TPC) layout
System-level simulation. We developed an architectural sim-
ulator to estimate application-level energy and performance
benefits of TiM-DNN. The simulator maps various DNN
operations, viz., vector-matrix multiplications, pooling, Relu,
etc. to TiM-DNN components and produces execution traces
consisting of off-chip accesses, write and in-memory oper-
ations in TiM tiles, buffer reads and writes, and RU and
SFU operations. Using these traces and the timing and energy
7models from circuit simulation and synthesis, the simulator
computes the application-level energy and performance.
TiM-DNN parameters. Table II details the micro-
architectural parameters for the instance of TiM-DNN
used in our evaluation, which contains 32 TiM tiles, with
each tile having 256x256 TPCs. The SFU consists of 64
Relu units, 8 vector processing elements (vPE) each with
4 lanes, 20 special function processing elements (SPEs),
and 32 Quantization Units (QU). SPEs computes special
functions such as Tanh and Sigmoid. The output activations
are quantized to ternary values using QUs. The latency of the
dot-product operation is 2.3 ns. TiM-DNN can achieve a peak
performance of 114 TOPs/sec, consumes ∼0.9 W power, and
occupies ∼1.96 mm2 chip area.
TABLE II: TiM-DNN micro-architectural parameters
Baseline. The processing
Fig. 11: Near-memory
compute unit for the baseline
design
efficiency (TOPS/W) of
TiM-DNN is 300X better
than NVIDIA’s state-
of-the-art Volta V100
GPU [15]. This is to
be expected, since the
GPU is not specialized
for ternary DNNs. In
comparison to previous
near-memory ternary
accelerators [33], TiM-
DNN achieves 55.2X
improvement in TOPS/W.
To perform a fairer comparison and to report the benefits
exclusively due to in-memory computations enabled by the
proposed TPC, we design a well-optimized near-memory
ternary DNN accelerator. This baseline accelerator differs
from TiM-DNN in only one aspect — tiles consist of regular
SRAM arrays (256x512) with 6T bit-cells and near-memory
compute (NMC) units (shown in Figure 11), instead of
the TiM tiles. Note that, to store a ternary word using the
SRAM array, we require two 6T bit-cells. The baseline tiles
are smaller than TiM tiles by 0.52x, therefore, we use two
baselines designs. (i) An iso-area baseline with 60 baseline
tiles and the overall accelerator area is same as TiM-DNN.(ii)
An iso-capacity baseline with the same weight storage
capacity (2 Mega ternary words) as TiM-DNN. We note that
the baseline is well-optimized, and our iso-area baseline can
achieve 21.9 TOPs/sec, reflecting an improvement of 17.6X
in TOPs/sec over near-memory accelerator for ternary DNNs
proposed in [33].
DNN Benchmarks. We evaluate the system-level energy and
performance benefits of TiM-DNN using a suite of DNN
benchmarks. Table III details our benchmark applications.
We use state-of-the-art convolutional neural networks (CNN),
viz., AlexNet, ResNet-34, and Inception to perform image
classification on ImageNet. We also evaluate popular recur-
rent neural networks (RNN) such as LSTM and GRU that
perform language modeling task on the Penn Tree Bank (PTB)
dataset [19]. Table III also details the activation precision and
accuracy of these ternary networks.
TABLE III: DNN benchmarks
V. RESULTS
In this section, we present various results that quantify the
improvements obtained by TiM-DNN. We also compare TiM-
DNN with other state-of-the-art DNN accelerators.
TABLE IV: Comparison with other DNN accelerators
A. Comparison with prior DNN accelerators
We first quantify the advantages of TiM-DNN over prior
DNN accelerators using processing efficiencies (TOPS/W and
TOPS/mm2) as our metric. Table IV details 3 prior DNN
accelerators including - (i) Nvidia Tesla V100 [15], a state-
of-the-art GPU, (ii) Neural-Cache [34], a design using bitwise
in-memory operations and bit-serial near-memory arithmetic
to realize dot-products, (iii) BRien [33], a near-memory ac-
celerator for ternary DNNs. As shown, TiM-DNN achieves
substantial improvements in both TOPS/W and TOPS/mm2.
GPUs, near-memory accelerators [33], and binary in-memory
accelerators [34] are less efficient than TiM-DNN as their
8efficiency is still limited by the on-chip memory bandwidth,
wherein they can simultaneously access one [15], [33] or at
most two [34] memory rows. In contrast, TiM-DNN offers
addition parallelism by simultaneously accessing ‘L’ (L=16)
memory rows to compute in-memory vector-matrix multipli-
cations.
B. Analysis of performance benefits
We analyze the performance benefits of TiM-DNN using
our two baselines (Iso-capacity and Iso-area near-memory
accelerators). Figure 12 shows the two major components of
the normalized inference time which are MAC-Ops (vector-
matrix multiplications) and Non-MAC-Ops (other DNN op-
erations) for TiM-DNN (TiM) and the baselines. Overall,
we achieve 5.1x-7.7x speedup over the Iso-capacity baseline
and 3.2x-4.2x speedup over the Iso-area baseline across our
benchmark applications. The speedups depend on the fraction
of application runtime spent on MAC-Ops, with DNNs having
higher MAC-Ops times attaining superior speedups. This is
expected as the performance benefits of TiM-DNN over the
baselines derive from accelerating MAC-Ops using in-memory
computations. Iso-area (baseline2) is faster than Iso-capacity
(baseline1) due to the higher-level of parallelism available
from the additional baseline tiles. The 32-tile instance of TiM-
DNN achieves 4827, 952, 1834, 2*106, and 1.9*106 infer-
ence/sec for AlexNet, ResNet-34, Inception, LSTM, and GRU,
respectively. Our RNN benchmarks (LSTM and GRU) fit on
TiM-DNN entirely, leading to better inference performance
than CNNs.
Fig. 12: Performance benefits of TiM-DNN
C. Analysis of energy benefits
We now analyze the application level energy benefits of
TiM-DNN over the superior of the two baselines (Baseline2).
Figure 13 shows major energy components for TiM-DNN
and Baseline2, which are programming (writes to TiM tiles),
DRAM accesses, reads (writes) from (to) activation and Psum
buffers, operations in reduce units and special function units
(RU+SFU Ops), and MAC-Ops. As shown, TiM reduces the
MAC-Ops energy substantially and achieves 3.9x-4.7x energy
improvements across our DNN benchmarks. The primary
cause for this energy reduction is that TiM-DNN computes
on 16 rows simultaneously per array access.
Fig. 13: Energy benefits of TiM-DNN
D. Kernel-level benefits
To provide more insights on the application-level benefits,
we compare the TiM tile and the baseline tile at the kernel-
level. We consider a primitive DNN kernel, i.e., a vector-
matrix computation (Out = Inp*W, where Inp is a 1x16 vector
and W is a 16x256 matrix), and map it to both TiM and
baseline tiles. We use two variants of TiM tile, (i) TiM-8 and
(ii) TiM-16, wherein we simultaneously activate 8 wordlines
and 16 wordlines, respectively. Using the baseline tile, the
vector-matrix multiplication operation requires row-by-row
sequential reads, resulting in 16 SRAM accesses. In contrast,
TiM-16 and TiM-8 require 1 and 2 accesses, respectively.
Figure 14 shows that the TiM-8 and TiM-16 designs achieve a
speedup of 6x and 11.8x respectively, over the baseline design.
Note that the benefits are lower than 8x and 16x, respectively,
as SRAM accesses are faster than TiM-8 and TiM-16 accesses.
Fig. 14: Kernel-level benefits of TiM tile
Next, we compare the energy consumption in TiM-8, TiM-
16, and baseline designs for the above kernel computation.
In TiM-8 and TiM-16, the bit-lines are discharged twice and
once, respectively, whereas, in the baseline design the bit-
lines discharge multiple (16*2) times. Therefore, TiM tiles
achieve substantial energy benefits over the baseline design.
The additional factor ‘2’ in (16*2) arises as the SRAM array
uses two 6T bit-cells for storing a ternary word. However,
the energy benefits of TiM-8 and TiM-16 is not 16x and
32x, respectively, as TiM tiles discharge the bitlines by a
larger amount (multiple ∆s). Further, the amount by which
the bitlines get discharged in TiM tiles depends on the number
of non-zero scalar outputs. For example, in TiM-8, if 50% of
the TPCs output in a column are zeros the bitline discharges
by 4∆, whereas if 75% are zeros the bitline discharges by
92∆. Thus, the energy benefits over the baseline design are
a function of the output sparsity (fraction of outputs that are
zero). Figure 14 shows the energy benefits of TiM-8 and TiM-
16 designs over the baseline design at various output sparsity
levels.
E. TiM-DNN area breakdown
We now discuss the area breakdown of various components
in TiM-DNN. Figure 15 shows the area breakdown of the TiM-
DNN accelerator, a TiM tile, and a baseline tile. The major
area consumer in TiM-DNN is the TiM-tile. In the TiM and
baseline tiles, area mostly goes into the core-array that consists
of TPCs and 6T bit-cells, respectively. Further, as discussed
in section IV, TiM tiles are 1.89x larger than the baseline tile
at iso-capacity. Therefore, we use the iso-area baseline with
60 tiles and compare it with TiM-DNN having 32 TiM tiles.
Fig. 15: TiM-DNN area breakdown
Fig. 16: Histogram of the bit-line voltages (VBL/VBLB)
under process variations
F. Impact of process variations
Finally, we study the impact of process variations on
the computations ( i.e., ternary vector-matrix multiplications)
performed using TiM-DNN. To that end, we first perform
Monte-Carlo circuit simulation of ternary dot-product oper-
ations executed in TiM tiles with nmax = 8 and L = 16
to determine the sensing errors under random variations. We
consider variations (σ/µ = 5%) [39] in the threshold voltage
(VT ) of all transistors in each and every TPC. We evaluate
1000 samples for every possible BL/BLB state (S0 to S8) and
determine the spread in the final bitline voltages (VBL/VBLB).
Figure 16 shows the histogram of the obtained VBL voltages of
all possible states across these random samples. As mentioned
in section III-B, the state Si represents n = i, where ‘n’ is the
ADC Output. We can observe in the figure that some of the
neighboring histograms slightly overlap, while the others do
not. For example, the histograms S7 and S8 overlap but S1 and
S2 do not. The overlapping areas in the figure represent the
samples that will result in sensing errors (SEs). However, the
overlapping areas are very small, indicating that the probability
of the sensing error (PSE) is extremely low. Further, the
sensing errors depend on ‘n’, and we represent this dependency
as the conditional sensing error probability [PSE(SE/n)]. It is
also worth mentioning that the error magnitude is always ±1,
as only the adjacent histograms overlap.
PE =
8∑
n=0
PSE(SE/n) ∗ Pn (1)
Fig. 17: Error probability during vector-matrix
multiplications
Equation 1 details the probability (PE) of error in the
ternary vector-matrix multiplications executed using TiM tiles,
where PSE(SE/n) and Pn are the conditional sensing error
probability and the occurrence probability of the state Sn
(ADC-Out = n), respectively. Figure 17 shows the values of
PSE(SE/n), Pn, and their product (PSE(SE/n)*Pn) for
each n. PSE(SE/n) is obtained using the Monte-Carlo simu-
lation (described above), and Pn is computed using the traces
of the partial sums obtained from sample ternary DNNs [9],
[11]. As shown in Figure 17, Pn is maximum at n=1 and
drastically decreases with higher values of n. In contrast,
PSE(SE/n) shows an opposite trend, wherein the probability
of sensing error is higher for larger n. Therefore, we find the
product PSE(SE/n)*Pn to be quite small across all values on
n. In our evaluation, the PE is found to be 1.5*10−4, reflecting
an extremely low probability of error. In other words, we have
roughly 2 errors of magnitude (±1) for every 10K ternary
vector matrix multiplications executed using TiM-DNN. In our
experiments, we found that PE = 1.5*10−4 has no impact on
the application level accuracy. We note that this is due to the
low probability and magnitude of error as well as the ability
of DNNs to tolerate errors in their computations [3].
VI. CONCLUSION
Ternary DNNs are extremely promising due to their ability
to achieve accuracy similar to full-precision networks on
complex machine learning tasks, while enabling DNN infer-
ence at low energy. In this work, we present TiM-DNN, an
10
in-memory accelerator for executing state-of-the-art ternary
DNNs. TiM-DNN is programmable accelerator designed us-
ing TiM tiles, i.e., specialized memory arrays for realizing
massively parallel signed vector-matrix multiplications with
ternary values. TiM tiles consist of a new Ternary Processing
Cell (TPC) that functions as both a ternary storage unit and
a scalar multiplication unit. We evaluate an embodiment of
TiM-DNN with 32 TiM tiles and demonstrate that TiM-DNN
achieves significant energy and performance improvements
over a well-optimized near-memory accelerator baseline.
REFERENCES
[1] C. Metz. Google, Facebook and Microsoft are remaking
themselves around AI. https://wired.com/2016/11/google-facebook-
microsoft-remaking-around-ai/ . Online. Accessed Sept. 17, 2017.
[2] S. Venkataramani, K. Roy, and A. Raghunathan. Efficient embedded
learning for iot devices. In Proc. ASP-DAC, pages 308–311, Jan 2016.
[3] Swagath Venkataramani, Ashish Ranjan, Kaushik Roy, and Anand
Raghunathan. Axnn: Energy-efficient neuromorphic systems using
approximate computing. In Proceedings of the 2014 International
Symposium on Low Power Electronics and Design, ISLPED ’14, pages
27–32, New York, NY, USA, 2014. ACM.
[4] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali
Farhadi. XNOR-Net: ImageNet Classification Using Binary Convolu-
tional Neural Networks. CoRR, abs/1603.05279, 2016.
[5] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Bina-
ryConnect: Training Deep Neural Networks with binary weights during
propagations. CoRR, abs/1511.00363, 2015.
[6] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and
Yuheng Zou. DoReFa-Net: Training Low Bitwidth Convolutional Neural
Networks with Low Bitwidth Gradients. CoRR, abs/1606.06160, 2016.
[7] Z. Lin, M. Courbariaux, R. Memisevic, Y. Bengio. Neural Networks
with Few Multiplications. CoRR, abs, 2015.
[8] Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. Trained
ternary quantization. CoRR, abs/1612.01064, 2016.
[9] Asit K. Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr.
WRPN: wide reduced-precision networks. CoRR, abs/1709.01134, 2017.
[10] Hande Alemdar, Nicholas Caldwell, Vincent Leroy, Adrien Prost-
Boucle, and Fre´de´ric Pe´trot. Ternary neural networks for resource-
efficient AI applications. CoRR, abs/1609.00222, 2016.
[11] Peiqi Wang, Xinfeng Xie, Lei Deng, Guoqi Li, Dongsheng Wang, and
Yuan Xie. Hitnet: Hybrid ternary recurrent neural network. In Advances
in Neural Information Processing Systems 31, pages 604–614. Curran
Associates, Inc., 2018.
[12] Naveen Mellempudi, Abhisek Kundu, Dheevatsa Mudigere, Dipankar
Das, Bharat Kaul, and Pradeep Dubey. Ternary neural networks with
fine-grained quantization. CoRR, abs/1705.01462, 2017.
[13] Qinyao He, He Wen, Shuchang Zhou, Yuxin Wu, Cong Yao, Xinyu
Zhou, and Yuheng Zou. Effective quantization methods for recurrent
neural networks. CoRR, abs/1611.10176, 2016.
[14] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen
Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. PACT:
parameterized clipping activation for quantized neural networks. CoRR,
abs/1805.06085, 2018.
[15] NVIDIA Tesla V100 Tensor Core GPU. https://www.nvidia.com/en-
us/data-center/tesla-v100/. Online. Accessed March 15, 2019.
[16] Google Edge TPU. https://cloud.google.com/edge-tpu/ . Online. Ac-
cessed March 15, 2019.
[17] S. Jain, S. Venkataramani, V. Srinivasan, J. Choi, P. Chuang, and
L. Chang. Compensated-dnn: Energy efficient low-precision deep
neural networks by compensating quantization errors. In 2018 55th
ACM/ESDA/IEEE Design Automation Conference (DAC), pages 1–6,
June 2018.
[18] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev
Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,
Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large
Scale Visual Recognition Challenge. International Journal of Computer
Vision (IJCV), 115(3):211–252, 2015.
[19] Ann Taylor, Mitchell Marcus, and Beatrice Santorini. The Penn Tree-
bank: An Overview, pages 5–22. Springer Netherlands, Dordrecht, 2003.
[20] X. Sun, S. Yin, X. Peng, R. Liu, J. Seo, and S. Yu. XNOR-RRAM:
A scalable and parallel resistive synaptic architecture for binary neural
networks. In 2018 Design, Automation Test in Europe Conference
Exhibition (DATE), pages 1423–1428, March 2018.
[21] X. Sun, X. Peng, P. Chen, R. Liu, J. Seo, and S. Yu. Fully parallel
RRAM synaptic array for implementing binary neural network with (+1,
-1) weights and (+1, 0) neurons. In 2018 23rd Asia and South Pacific
Design Automation Conference (ASP-DAC), pages 574–579, Jan 2018.
[22] X. Liu, M. Mao, B. Liu, H. Li, Y. Chen, B. Li, Yu Wang, Hao Jiang,
M. Barnell, Qing Wu, and Jianhua Yang. RENO: A high-efficient
reconfigurable neuromorphic computing accelerator design. In 2015
52nd ACM/EDAC/IEEE Design Automation Conference (DAC), pages
1–6, June 2015.
[23] Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan
Liu, Yu Wang, and Yuan Xie. PRIME: A Novel Processing-in-memory
Architecture for Neural Network Computation in ReRAM-based Main
Memory. In Proceedings of the 43rd International Symposium on
Computer Architecture, ISCA ’16, pages 27–39, Piscataway, NJ, USA,
2016. IEEE Press.
[24] Shankar Ganesh Ramasubramanian, Rangharajan Venkatesan, Mrigank
Sharad, Kaushik Roy, and Anand Raghunathan. Spindle: Spintronic deep
learning engine for large-scale neuromorphic computing. In Proceedings
of the 2014 international symposium on Low power electronics and
design, pages 15–20. ACM, 2014.
[25] Aayush Ankit, Izzat El Hajj, Sai Rahul Chalamalasetti, Geoffrey Ndu,
Martin Foltin, R. Stanley Williams, Paolo Faraboschi, Wen-mei Hwu,
John Paul Strachan, Kaushik Roy, and Dejan Milojicic. PUMA: A
programmable ultra-efficient memristor-based accelerator for machine
learning inference. In International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS), 2019.
[26] J. Zhang, Z. Wang, and N. Verma. In-Memory Computation of a
Machine-Learning Classifier in a Standard 6T SRAM Array. IEEE
Journal of Solid-State Circuits, 52(4):915–924, April 2017.
[27] A. Biswas and A. P. Chandrakasan. Conv-RAM: An energy-efficient
SRAM with embedded convolution computation for low-power CNN-
based machine learning applications. In 2018 IEEE International Solid
- State Circuits Conference - (ISSCC), pages 488–490, Feb 2018.
[28] Rui Liu, Xiaochen Peng, Xiaoyu Sun, Win-San Khwa, Xin Si, Jia-Jing
Chen, Jia-Fang Li, Meng-Fan Chang, and Shimeng Yu. Parallelizing
sram arrays with customized bit-cell for binary neural networks. In
Proceedings of the 55th Annual Design Automation Conference, pages
21:1–21:6, New York, NY, USA, 2018. ACM.
[29] Z. Jiang, S. Yin, M. Seok, and J. Seo. XNOR-SRAM: In-Memory
Computing SRAM Macro for Binary/Ternary Deep Neural Networks.
In 2018 IEEE Symposium on VLSI Technology, pages 173–174, June
2018.
[30] Amogh Agrawal, Akhilesh Jaiswal, Bing Han, Gopalakrishnan Srini-
vasan, and Kaushik Roy. Xcel-RAM: Accelerating Binary Neu-
ral Networks in High-Throughput SRAM Compute Arrays. CoRR,
abs/1807.00343, 2018.
[31] L. Xia, B. Li, T. Tang, P. Gu, X. Yin, W. Huangfu, P. Y. Chen, S. Yu,
Y. Cao, Y. Wang, Y. Xie, and H. Yang. MNSIM: Simulation platform
for memristor-based neuromorphic computing system. In 2016 Design,
Automation Test in Europe Conference Exhibition (DATE), pages 469–
474, March 2016.
[32] Shubham Jain, Abhronil Sengupta, Kaushik Roy, and Anand Raghu-
nathan. RxNN: A Framework for Evaluating Deep Neural Networks on
Resistive Crossbars. CoRR, abs/1809.00072, 2018.
[33] K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara,
M. Ikebe, T. Asai, S. Takamaeda-Yamazaki, T. Kuroda, and M. Mo-
tomura. Brein memory: A 13-layer 4.2 k neuron/0.8 m synapse bi-
nary/ternary reconfigurable in-memory deep neural network accelerator
in 65 nm cmos. In 2017 Symposium on VLSI Circuits, pages C24–C25,
June 2017.
[34] Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan,
Ravi Iyer, Dennis Sylvester, David Blaauw, and Reetuparna Das. Neural
cache: Bit-serial in-cache acceleration of deep neural networks. In
Proceedings of the 45th Annual International Symposium on Computer
Architecture, ISCA ’18, pages 383–396, Piscataway, NJ, USA, 2018.
IEEE Press.
[35] Akhilesh Jaiswal, Indranil Chakraborty, Amogh Agrawal, and Kaushik
Roy. 8t SRAM cell as a multi-bit dot product engine for beyond von-
neumann computing. CoRR, abs/1802.08601, 2018.
[36] Mingu Kang, Sujan Gonugondla, Min-Sun Keel, and Naresh
11
R. Shanbhag. An energy-efficient memory-based high-throughput VLSI
architecture for convolutional networks. 2015:1037–1041, 08 2015.
[37] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan. Computing in Memory
With Spin-Transfer Torque Magnetic RAM. IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, 26(3):470–483, March 2018.
[38] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay.
Neurocube: A Programmable Digital Neuromorphic Architecture with
High-Density 3D Memory. In 2016 ACM/IEEE 43rd Annual Interna-
tional Symposium on Computer Architecture (ISCA), pages 380–392,
June 2016.
[39] K. J. Kuhn, M. D. Giles, D. Becher, P. Kolar, A. Kornfeld, R. Kotlyar,
S. T. Ma, A. Maheshwari, and S. Mudanai. Process technology variation.
IEEE Transactions on Electron Devices, 58(8):2197–2208, Aug 2011.
Shubham Jain is currently a PhD student in the
School of Electrical and Computer Engineering,
Purdue University. His primary research interests
include exploring the circuit and architectural tech-
niques for emerging post-CMOS devices, in-memory
computing, approximate computing, and energy-
efficient hardware architecture for deep learning. He
has a B.Tech (Hons.) degree in Electronics and Elec-
trical Communication Engineering from the Indian
Institute of Technology, Kharagpur, India, in 2012.
Previously, he worked as a design engineer in the
Bangalore Design Center, Qualcomm, Bangalore, India from 2012 to 2014. He
also worked as a summer intern at IBM T.J Watson Research Center, Yorktown
Heights, in 2017 and 2018. He has received the Mitacs Globalink scholarship
from Mitacs, in 2011, the Andrews Fellowship from Purdue University, in
2014, and the A. Richard Newton Young Student Fellowship from DAC in
2015. His research has received the best technical paper award in DAC 2018,
and a best-in session award in TECHCON 2016.
Sumeet Kumar Gupta Sumeet Kumar Gupta re-
ceived his Ph.D. degree from the School of Electrical
and Computer Engineering, Purdue University, West
Lafayette IN in 2012. He is currently an Assistant
Professor of Electrical and Computer Engineering at
Purdue University. Prior to this, he was an Assistant
professor of Electrical Engineering at The Pennsyl-
vania State University from 2014 to 2017 and a
Senior Engineer at Qualcomm Inc. in San Diego
CA from 2012 to 2014, where he developed circuit
design techniques and methodologies for analysis
and benchmarking of standard cells. His research interests include low
power variation-aware VLSI circuit design, neuromorphic computing, memory
design, and in-memory computing, nano-electronics and spintronics, device-
circuit co-design and nano-scale device modeling. He has published over 90
articles in refereed journals and conferences. He was the recipient of DARPA
Young Faculty Award in 2016, an Early Career Professorship by Penn State
in 2014, the 6th TSMC Outstanding Student Research Bronze Award in 2012,
Magoon Award and the Outstanding Teaching Assistant Award from Purdue
University in 2007 and Intel Ph.D. Fellowship in 2009.
Anand Ragunathan is a Professor of Electrical
and Computer Engineering and Chair of the VLSI
area at Purdue University, where he directs research
in the Integrated Systems Laboratory. His current
areas of research include domain-specific architec-
ture, system-on-chip design, computing with post-
CMOS devices, and heterogeneous parallel com-
puting. Previously, he was a Senior Research Staff
Member at NEC Laboratories America, where he led
projects on system-on-chip architecture and design
methodology. He has also held the Gopalakrishnan
Visiting Chair in the Department of Computer Science and Engineering at the
Indian Institute of Technology, Madras.
Prof. Raghunathan has co-authored a book, eight book chapters, and over
200 refereed journal and conference papers, and holds 21 U.S patents. His
publications received eight best paper awards and five best paper nominations.
He received a Patent of the Year Award and two Technology Commer-
cialization Awards from NEC, and was chosen among the MIT TR35 (top
35 innovators under 35 years across various disciplines of science and
technology) in 2006.
Prof. Raghunathan has been a member of the technical program and
organizing committees of several leading conferences and workshops, chaired
premier IEEE/ACM conferences (CASES, ISLPED, VTS, and VLSI Design),
and served on the editorial boards of various IEEE and ACM journals in
his areas of interest. He received the IEEE Meritorious Service Award and
Outstanding Service Award. He is a Fellow of the IEEE and Golden Core
Member of the IEEE Computer Society. Prof. Raghunathan received the B.
Tech. degree from the Indian Institute of Technology, Madras, and the M.A.
and Ph.D. degrees from Princeton University.
