TraNNsformer: Neural network transformation for memristive crossbar
  based neuromorphic system design by Ankit, Aayush et al.
TraNNsformer: Neural Network Transformation for
Memristive Crossbar based Neuromorphic System Design
Aayush Ankit, Abhronil Sengupta, Kaushik Roy
School of Electrical and Computer Engineering, Purdue University
{aankit, asengup, kaushik}@purdue.edu
Abstract—Implementation of Neuromorphic Systems using
post Complementary Metal-Oxide-Semiconductor (CMOS) tech-
nology based Memristive Crossbar Array (MCA) has emerged as
a promising solution to enable low-power acceleration of neural
networks. However, the recent trend to design Deep Neural
Networks (DNNs) for achieving human-like cognitive abilities
poses significant challenges towards the scalable design of neu-
romorphic systems (due to the increase in computation/storage
demands). Network pruning [7] is a powerful technique to
remove redundant connections for designing optimally connected
(maximally sparse) DNNs. However, such pruning techniques
induce irregular connections that are incoherent to the crossbar
structure. Eventually they produce DNNs with highly inefficient
hardware realizations (in terms of area and energy). In this work,
we propose TraNNsformer - an integrated training framework
that transforms DNNs to enable their efficient realization on
MCA-based systems. TraNNsformer first prunes the connectivity
matrix while forming clusters with the remaining connections.
Subsequently, it retrains the network to fine tune the connections
and reinforce the clusters. This is done iteratively to transform
the original connectivity into an optimally pruned and maximally
clustered mapping. We evaluated the proposed framework by
transforming different Multi-Layer Perceptron (MLP) based
Spiking Neural Networks (SNNs) on a wide range of datasets
(MNIST, SVHN and CIFAR10) and executing them on MCA-
based systems to analyze the area and energy benefits. Without
accuracy loss, TraNNsformer reduces the area (energy) consump-
tion by 28% - 55% (49% - 67%) with respect to the original
network. Compared to network pruning, TraNNsformer achieves
28% - 49% (15% - 29%) area (energy) savings. Furthermore,
TraNNsformer is a technology-aware framework that allows
mapping a given DNN to any MCA size permissible by the
memristive technology for reliable operations.
Index Terms—Neuromorphic Computing, Energy-Efficiency,
Memristive Crossbars, Sparsity, Computer-aided Design
I. INTRODUCTION AND RELATED WORK
The quest to reverse-engineer the powerful cognitive capa-
bilities of the human brain inspired the research in artificial
neural networks. Although, the precise structure and func-
tioning of the human brain has not been clearly understood,
there is sufficient evidence that suggests the hierarchical
organization of neurons (cells) in the brain [5]. Deep Neural
Networks (DNNs), inspired from this hierarchy of neurons
and synapses in the human brain, have achieved outstanding
classification accuracies across myriad cognitive applications
including computer vision [12], speech recognition and nat-
ural language processing [14]. Eventually, this has led to
The work was supported in part by, Center for Spintronic Materials, Inter-
faces, and Novel Architectures (C-SPIN), a MARCO and DARPA sponsored
StarNet center, by the Semiconductor Research Corporation, the National
Science Foundation, Intel Corporation and by the Vannevar Bush Faculty
Fellowship.
its ubiquitous presence in recognition applications across a
wide variety of computational platforms that we use in our
day-to-day life. For example, Siri, Google Now, Cortana are
intelligent personal assistants developed by Apple, Google and
Microsoft respectively and run DNNs to recognize external
inputs (image, voice etc.).
The tremendous improvement in DNN performance has
been possible due to the surge in the scale of the DNNs. LeCun
et al. classified handwritten digits in 1998 with less than
1M synapses [13]. Krizhevsky et al. proposed AlexNet and
won the ImageNet challenge in 2012 by using 650k neurons
and 60M synapses [12]. In 2014, Simoyan et al. developed
VGGNet using 138M synapses [23]. Karpathy et al. proposed
a DNN to convert image to natural language employing 130M
Convolutional Neural Network (CNN) synapses and 100M
Recurrent Neural Network (RNN) synapses [10].
DNNs have powerful cognitive abilities but involve data-
intensive computations leading to high power and memory
bandwidth requirements. Larger DNNs require larger memory
size to fit the model (weights) as well as large number of data
movements between memory and the computation core. Con-
sequently, their energy and resource requirements impede their
deployment in low-power and resource constrained platforms.
Furthermore, their energy consumption is orders of magnitude
greater than the human brain. For example, AlexNet thrives on
2-4 GOPS (60M synaptic weights) of compute power (memory
storage) per inference. Eventually, these power and memory
bottlenecks posed by DNN acceleration on von-Neumann
machines motivated the research on Neuromorphic Computing
(NC) systems.
NC systems are based on device and circuit realizations
intended to mimic the functionality of neurons and synapses in
the brain. However, their CMOS implementations have been
shown to suffer from inefficiencies that stem from inherent
mismatch between the NC building blocks (synapse and
neuron) and CMOS primitives (instructions, Boolean logic).
Consequently, the CMOS realization of synapse functionality
requires dozens of transistors to mimic a single synapse [1]. To
this effect, device, circuit and architecture level neuromorphic
designs have been explored using emerging memristive tech-
nologies [2], [8], [21]. Memristors are programmable resistors
and can encode the synaptic weights of the DNN. An MCA is
a crossbar with memristive devices at its cross-points, which
receives voltage inputs (at its rows) and produces an output
current (at its columns) that is the weighted summation of the
encoded weights at that column and the input voltage. This
ar
X
iv
:1
70
8.
07
94
9v
2 
 [c
s.E
T]
  4
 M
ar 
20
18
is a direct consequence of the Kirchhoff’s law, as the current
output along a column from any cross-point will be the product
of the conductance at that cross-point and the voltage across
it. Thus, MCA is an analog computation unit and performs
highly area and energy efficient inner-product operations [19].
The MCA outputs are interfaced with neurons to produce
the neuron output. Additionally, an MCA stores the weights
thereby enabling in-memory computing which enhances the
energy-efficiency by circumventing the energy-hungry data
movements between the memory and the computation core.
Consequently, MCAs have been aggressively harnessed for
energy-efficient DNN acceleration [2], [4], [22].
Despite the success in efficient NC system designs, the
increasing scale of DNNs impede their utility in resource-
constrained and low power systems. This is because the DNN
topology (number of layers and neurons in each layer) is
fixed before the training process starts, thereby removing the
avenues of network structure optimization. Previous research
has shown network pruning as an efficient technique for
dynamically learning a DNN’s structure (connections) during
the training process [7]. This produces highly sparse DNN
structures, which significantly reduce the storage (memory),
and computations required by the DNN, thereby making them
suitable for lower power realizations. However, such algorith-
mic approaches are not coherent with memristive technology
due to the rigid structure of an MCA. MCAs physically map
the synapses (connections) in the DNN as well as act as the
computation units. Pruning a synapse mapped onto an MCA
merely makes the cross-point unused. This is because each
cross-point physically serves as a possible connection channel
between a specific input neuron (mapped onto the row) and an
output neuron (mapped onto the column). Hence, despite being
energy-efficient inner-product engines, the structural rigidity
of an MCA degrades its utility for accelerating sparser DNNs.
Consequently, realizations of such sparse DNNs onto MCA
based architectures would result in highly area-inefficient
designs. Typical MCA based architectures use peripherals
namely buffers, communication and control logic to integrate
the MCAs in order to facilitate acceleration of DNNs with
varying topologies. Increasing the sparsity deteriorates the
crossbar utilization, which results in a peripheral dominated
energy profile. Eventually, network sparsity does not translate
to commensurate area and energy savings in an MCA based
architecture. In this work, we propose TraNNsformer - an inte-
grated training framework for dynamically learning clustered
connections during training. This is motivated by the observa-
tion that pruning a DNN at the MCA granularity, instead of
a synapse granularity, can preserve the benefits of algorithmic
pruning and network sparsity at the hardware level, in MCA
based neuromorphic systems. Our approach efficiently prunes
the DNN by dynamically making pruning decisions from
accuracy perspective (removing unnecessary connections to
maximize sparsity) as well as clustering perspective (removing
unclustered synapses to maximize MCA utilization) thereby
producing trained DNNs that can efficiently utilize the benefits
of post-CMOS memristive technology. Further, our proposed
transformation is technology agnostic as it enables mapping
W2,i
W1,i
Wn,i
. .
 .
X1
X2
Xn
Ni
Neuron
Synapses
In
pu
ts
(a)
ΣXjWi,j
Neurons
N1 N2 N3 Nm
V1
V2
Vn
G1,1
G2,1
Gn,1
G1,2
G2,2
Gn,2
G1,m
G2,m
Gn,m
In
pu
ts
(b)
I1 ↓
Vi = Xi Ii = ΣXjWi,j Gi,j = Wi,j
Im ↓
Fig. 1: (a) A two-layered MLP based Neural Network (b) Neural
Network mapped to Memristive Crossbar Array (MCA)
of connectivity structures using any MCA size permitted by
the memristive technology for reliable operations.
This work focuses on efficient transformations for Fully
Connected (FC) layers. Majority of the image processing and
computer vision applications run CNNs, which are comprised
of several convolution layers followed by a few FC layers.
However, more than 96% of the synapses are in the FC
layers [6]. Convolution layers are inherently sparse as each
output neuron receives inputs from a fixed receptive field
(not all the input neurons) equal to the kernel size. Further,
this sparsity has a definite structure as these connections are
known beforehand thereby allowing their efficient realizations
on MCA structures. Additionally, data reuse (weight sharing)
in convolution layers allows MCA reuse thereby allowing a
sub-linear scaling of the neuromorphic system with respect
to the DNN size. In contrast, the FC layers have no data
reuse. Although, data batching helps to mitigate this while
training, it is unsuitable for real-time applications with latency
constraints [6]. Further, the FC layers when subjected to
pruning results in irregular removal of connections [3] as it
is primarily driven by accuracy constraints. FC layers are
also widely used in different types of DNNs namely MLP,
RNN, Long Short Term Memory (LSTM) [24] and SNN [5].
Hence, efficient transformation on FC layers to remove the
unnecessary connections along with preserving a clustered
connectivity structure is pivotal to efficient realizations of
DNNs on MCA based architectures.
Prior work on learning structured connectivity in DNNs
during the training process has focused on transformations
for CMOS based architectures (Graphics Processing Unit)
[3], [25]. While, post-CMOS based MCA architectures were
explored in [26], they focused on clustering the synapses after
the training process finishes. Our work distinguishes from the
prior works as it proposes an integrated training framework to
maximize the benefits from post-CMOS MCA based systems
by designing DNN transformations, which conform to MCA’s
structural rigidity. Additionally, we also show that offline
clustering (clustering after training) is unable to preserve the
benefits of sparsity at the hardware level.
In summary, this work makes the following contributions:
1) We present a Size-Constrained Iterative Clustering
(SCIC) algorithm to enable efficient mapping of FC
based connectivity matrices on MCA sizes permissible
by the underlying technology.
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
0 1 0 1
1 0 1 0
0 1 0 1
1 0 1 0 1 1
1 1
1 1 0 0
1 1 0 0
0 0 1 1
0 0 1 1
O1 O2 O3 O4
I1
I2
I3
I4
Input Connectivity map
Output
1 1
1 1
Original DNN
DNN + 
Network Pruning
DNN + 
TraNNsformer
Original NN
Train 
synapses
Fraction of 
unclustered 
synapses > 
Threshold
Y
N
Cluster
Prune
Network 
Prune
SCIC on 
unclustered 
synapses
Check for 
convergence
N
Transformed 
NN Y
TraNNsformer
(a) (b)
Fig. 2: (a) Logical Flow Diagram of TraNNsformer Framework. The original DNN architecture during training undergoes clustering
to form regions that can be mapped onto MCAs with high utilization factors, while pruning the connections that don’t contribute to
cluster formation. (b) Toy example to illustrate the impact of Network pruning and TraNNsformer on a DNN connectivity matrix.
Network pruning leads to irregular sparsity that cannot be mapped directly onto MCAs. TraNNsformer forms smaller clusters that
can be mapped onto MCAs. Note that 1/0 only represents a connection being present and not the actual value of the weight.
2) We propose TraNNsformer - an Integrated Training
Framework harnessing the SCIC algorithm to enable
energy and area efficient design of DNNs for MCA
based architectures.
3) We evaluate the proposed methodology on a wide
range of benchmarks namely Digit Recognition, House
Number Recognition and Object Classification using
different MLP based SNNs.
4) We analyze the resulting energy, area benefits of
TraNNsformer for post-CMOS crossbar based architec-
tures and CMOS based general-purpose architectures.
II. NEURAL NETWORK BASICS
Neural Networks are a class of machine learning algorithms
that are comprised of multiple layers of neurons (activations)
interconnected with synapses (weights). MLPs are a class of
neural networks with fully connected topology i.e. each neuron
in a layer receives inputs from all the neurons in the previous
layer. A neuron receives inputs that are modulated based on the
synapse connecting the input and output neuron. Subsequently,
it performs a non-linear computation on the received inputs
to produce the output, which is sent to the neurons in the
successive layers. Fig. 1(a) shows a two-layered MLP topology
being mapped onto an MCA (shown in Fig. 1(b)).
SNN is regarded as the third generation neural network.
SNNs require the input to be encoded as spike trains and in-
volve spike-based (0/1) information transfer between neurons.
At a particular instant, each spike is propagated through the
layers of the network while the neurons accumulate the spikes
over time causing the neuron to fire or spike.
III. TRANNSFORMER FRAMEWORK
In this section, we discuss in detail about the TraNNsformer
framework (shown in Fig. 2(a)), its effect on DNN sparsity
(shown in Fig. 2(b)) and the resulting benefits on two types of
architectures 1. MCA based architecture and 2. CMOS based
general-purpose architecture. Subsection 3.1 describes the Size
Constrained Clustering Algorithm (SCIC), which converts a
DNN’s connectivity structure into a set of high utilization
clusters that can be mapped onto MCAs. Subsection 3.2 details
TraNNsformer, which is the Integrated Training approach,
built on SCIC to transform the DNN connectivity matrix into
an optimally sparse and maximally clustered structure during
the training process. Subsection 3.3 discusses the benefits of
TraNNsformer on area and energy consumption for MCA
based architectures. Subsection 3.4 discusses the impact of
TraNNsformer for CMOS based general-purpose architectures.
A. Size Constrained Iterative Clustering (SCIC)
Size constrained Iterative Clustering is an adaptation of the
Spectral Clustering algorithm [17] that generates clusters from
a connectivity matrix. Connectivity matrix (C) is a (0, 1)-
matrix that represents the morphology of a layer in the DNN
such that a value Cij being one corresponds to a non-zero
synapse between the “ith” input neuron and “jth” output neuron.
Utilization factor is the fraction of used (mapped) cross-points
in an MCA. A zero value in a cluster would result in an unused
cross-point (if the cluster is mapped to MCA).
Algorithm 1 Spectral Clustering (SC)
Input: Similarity matrix S ∈ Rn×n for the graph (a row in S
corresponds to a graph node), K clusters to construct
1: Compute the degree matrix: D
2: Compute the normaized laplacian matrix: L
3: Perform an eigenvalue decomposition: L = UΣU T
4: Extract the K columns of U corresponding to the K
smallest eigenvalues to form: U˜
5: Cluster the row vectors of U˜ using K-means algorithm
Output: K clusters - a row vector of U˜ corresponds to a row of S
Spectral Clustering (SC) is a graph clustering algorithm
that produces a set of disjoint graph nodes such that intra
(inter) cluster associativity is maximized (minimized). We
adopt spectral clustering to cluster the connectivity matrix
for a DNN layer where the input and output neurons are the
graph nodes and a synapse corresponds to a graph edge. As
shown in Algorithm 1, a symmetric matrix (L) is defined on
the graph (connectivity matrix) using the adjacency matrix
(S) and degree matrix (D). Subsequently, it undergoes an
eigenvalue decomposition followed by dimensionality reduc-
tion. Finally, the remaining row vectors are clustered using
the K-means algorithm. The intra-associativity maximization
produces clusters that can be mapped to MCAs with high
utilization factor. Furthermore, inter-associativity minimization
I1
I2
W1,1
W1,2
W2,1
W2,2
I3
I4
W1,3
W1,4
W2,3
W2,4
N1 N2
Synapses
Wi,2
Wi,1
Wi,4
I1
I2
I4
Ni
Neuron
In
pu
ts
Wi,3
I3
t
Vmem(N1)
vth
O1 gets 
integrated into 
N1 at time t1
O3 gets 
integrated into 
N1 at time t2
O1
O3
(a) (b)
4x2 Feed-forward 
network (i ∈ {1,2})
2x2 MCA 2x2 MCA
O2 O4
Time-multiplexed operation of degree-2 
(output computation requires 2 time-steps)
t1 t2
Fig. 3: (a) A feed-forward neural network with neuron fan-in of
4 (b) Mapping the 4 fan-in neurons using a 2×2 MCAs
enables low overhead or high throughput (number of neuron
outputs computed per cycle) integration of MCA currents for
generating a neuron output. This is because the synapses
corresponding to a particular output neuron can be spread
across multiple clusters (mapped across multiple MCAs).
Eventually, the crossbar output currents from multiple MCAs
are integrated onto the output neuron in a time-multiplexed
fashion (shown in Fig. 3) to generate the final neuron output.
Hence, minimizing the inter-cluster associativity results in a
commensurate decrease in the inter MCA interaction thereby
maximizing the throughput.
SCIC is an iterative algorithm that minimizes the number
of unclustered synapses while ensuring cluster generation
with higher utilization factors. As shown in Algorithm 2,
each iteration of SCIC algorithm runs SC on the remaining
connectivity matrix until the reduction in unclustered synapses
ceases. Further, in each iteration, we greedily select high
quality clusters (clusters that map to MCAs with high uti-
lization factor) that meet a specific quality threshold. Other
clusters are ignored and merged with the existing connectivity
matrix to explore new clustering avenues in the subsequent
iterations. The threshold subsequently decays when the for-
mation of new clusters slows down. Hence, the greedy and
iterative approaches synergistically ensure efficient clustering
Algorithm 2 Size Constrained Iterative Clustering (SCIC)
Input: Connectivity matrix C ∈ Rm×n (m and n are number of input
and output neurons respectively), crossbar size, base util factor,
min util factor
1: Initizalize: cluster set = C , num clusters = k
2: while ∆cluster set 6= 0 do
3: for all cluster in cluster set do
4: if sizeof (cluster) > crossbar size then
5: clusters temp ← ∅
6: util factor = base util factor
7: while clusters temp ≡ ∅ do
8: if util factor < min util factor then
9: break
10: else
11: decay (util factor)
12: C˜ = connectivity matrix of cluster
13: construct similarity matrix S from C˜
14: clusters temp = spectral clustering(S, k)
15: clusters set ← cluster set + clusters temp
16: update C to reflect unclustered synapses
Output: cluster set
to produce high quality clusters. As mentioned before, MCA
sizes are limited by the memristive technology to ensure their
reliable operation. SCIC iteratively breaks down (forms sub
clusters) from the existing clusters until they meet the MCA
size specified by the given technology. Thus, SCIC enables
a technology-aware mapping of connectivity matrices onto
MCAs.
B. Integrated Training Approach
As shown in Algorithm 3, each training iteration (back-
propagation) is accompanied with pruning followed by SCIC.
Pruning removes the synapses that do not affect the accuracy
of the DLN. The result of pruning is encoded as a “prune map”
which is a (0, 1)-matrix (similar to the connectivity matrix)
where “0s” represent a pruned synapse. “0s” in the prune map
correspond to Accuracy Don’t Cares (ADCs). Subsequently,
clusters produced in SCIC are used to form a “cluster map”
that denotes if a synapse in the connectivity matrix is part of
a previously formed cluster. “0” in the cluster map represent
Clustering Don’t Cares (CDCs). The union of cluster map and
prune map is masked from being pruned in the subsequent
iterations. The synapses, which belong to both the ADC and
CDC set, are aggressively pruned to maximize pruning without
affecting the cluster quality. The subsequent training iteration
tries to recover the accuracy loss incurred due to pruning.
Although SCIC generates high quality clusters, it leaves a
large fraction of synapses unclustered. This results in large
number of MCAs with low utilization factors being mapped
to the unclustered synapses, thereby diminishing the benefits
of SCIC. Consequently, a training algorithm based on offline
clustering i.e. clustering the synapses at the end of the training
process will suffer from inefficiencies resulting from higher
fraction of unclustered synapses. However, it is interesting to
note that subsequent pruning of the synapses belonging to the
intersection of ADC and CDC set removes several unclustered
synapses. Furthermore, this pruning exposes new avenues for
SCIC to generate high quality clusters from the remaining
unclustered synapses. TraNNsformer is inspired from this
observation to synergistically combine the benefits of network
pruning and SCIC to make the DNNs as sparse as possible
while ensuring the clustered structure to produce MCA map-
pings with high utilization factors. Thus, TraNNsformer allows
to dynamically learn the DLN structure in a clustered way
during the training process in order to produce an optimized
network for MCA based architectures.
It is worth noting that TraNNsformer favors cluster forma-
tion in the beginning to minimize the fraction of unclustered
synapses. Once the unclustered synapses have been signif-
icantly reduced, TraNNsformer initiates cluster pruning (as
shown in Algorithm 3). Cluster pruning incrementally prunes
an entire cluster based on a combined score that quanti-
fies cluster quality and the cluster’s contribution to output
accuracy. Although, cluster pruning may lead to accuracy
degradation, subsequent training iteration ensures a graceful
recovery of the lost accuracy. It is worth noting that cluster
pruning does not affect the overall MCA utilization as it
entirely removes the mapped MCA. Thus, cluster pruning
allows achieving higher network sparsity (similar to network
pruning) while ensuring the clustered structure of the DLN’s
connectivity matrix.
Algorithm 3 TraNNsformer
1: Train the connectivity for an epoch
2: if num unclustered synapses < threshold then
3: if training error < training error previous then
4: cluster prune()
5: else
6: if training error < training error previous then
7: Prune and update prune map
8: Run SCIC on unclustered synpases and update cluster map
9: connectivity ← (prune map ∪ cluster map)
10: Go to 1 if convergence is not reached
C. Impact on Crossbar based architecture
Crossbar-based architectures for DNN acceleration are com-
prised of computation cores each of which consists of MCAs
and peripherals associated with an MCA namely buffers,
communication and control logic [2]. Hence, the number of
cores (num core) has a linear dependence on the number of
MCAs (num mca) as shown in eqn. 1 (where “k” is a micro-
architecture dependent constant). TraNNsformer enables tech-
nology aware optimization to learn an optimally clustered
network structure such that a learnt cluster can be mapped onto
an MCA with high utilization factor. Consequently, it ensures
that the network sparsity efficiently translates to reduction in
the number of MCAs required to map the transformed DNN
with respect to the original DNN. This results in a commensu-
rate reduction in the number of cores thereby leading to area
savings.
num core =
num mca
k
(1)
The energy profile for an MCA based architecture is comprised
of MCA energy (includes neuron energy) and peripheral
energy components. Hence, the total energy consumption for
a single inference of DNN execution can be defined as shown
in eqn. 2.
Total Energy =
all cores∑
i=0
MCA Energy + Peripheral Energy (2)
An increase in DNN sparsity results in a corresponding
decrease in the total MCA energy component irrespective of
the connectivity structure. However, the total peripheral energy
component depends on the number of MCA (num mca) being
used. As discussed before, network pruning does not lead
to significant reductions in the number of MCAs due to the
irregular nature of sparsity pattern, thereby not affecting the
total peripheral energy. This reduces the overall energy benefits
that can be obtained by highly sparse DNN connectivity struc-
tures. Additionally, the total energy profile becomes peripheral
energy dominated, which would prevent harnessing the energy
benefits from further efficient memristive technologies. On
the contrary, TraNNsformer helps to obtain commensurate
reductions in MCA energy as well as the peripheral energy
component, which in turn results in significant savings in total
energy consumption. Consequently, the energy profile shows a
favorable distribution between the MCA and peripheral energy
components.
D. Impact on CMOS based general-purpose architecture
Typical digital CMOS based general-purpose architectures
for DNN execution consist of a memory unit to store the
weights (synapses) and a computation core to perform neu-
ronal computations using the weights fetched from the mem-
ory [6]. The total energy consumption per DNN inference in
such cases can be described as eqn. 3, where “n” is the number
of synapses, Computation corresponds to the neuron energy
expended per computation (multiplication and accumulation),
Memory Access corresponds to the energy spent for fetching
a weight from memory and Leakage represents the overall
leakage energy consumed per inference.
Energy =
n∑
i=0
(Computation + Memory Access) + Leakage (3)
Network pruning reduces the number of synapses (“n”)
thereby reducing the number of computations and memory
accesses required for DNN computation. However, this does
not lead to significant reduction in the memory size, as the
zero weights are an instrumental part of the DNN topology
information. Consequently, DNN sparsity does not lead to
significant energy savings as both memory access and leakage
energies are strong functions of memory size. Additionally,
energy profiles of FC layers are memory dominated which
further reduces the overall energy savings. In contrast to
network pruning, TraNNsformer helps to reduce the number
of synapses as well as the memory storage requirements for
remaining synapses. As shown in Fig. 2(b), such a transformed
DNN execution resembles execution of several clusters, where,
each cluster’s execution is identical to a smaller DNN structure
itself (except for the final activation computation). Eventually,
the partial products generated from these smaller DNNs will
need to be synchronized (combined) together to complete
an inference of transformed DNN execution. Although, this
synchronization is associated with an overhead, the overall
saving from smaller memory size outweighs the overhead.
Thus, TraNNsformer achieves better energy efficiency com-
pared to the original as well as pruned DNN connectivities
on CMOS based general-purpose architectures. Kindly note
that other techniques based on network compression have also
been studied to remove zero weight storage and optimize the
memory size requirements. However, significant modifications
are required at the hardware level to accelerate such com-
pressed DNNs [6]. On the contrary, TraNNsformer helps to
utilize the benefits from a reduced memory size on the original
accelerator (uncompressed DNNs) itself.
IV. EXPERIMENTAL METHODOLOGY
TraNNsformer framework was designed using MATLAB by
utilizing the relevant components from MATLAB DeepLearn
Application Dataset Layers Neurons Synapses
Digit Recognition MNIST 4 3194 2392800
House Number Recognition SVHN 5 4634 4120800
Object Classification CIFAR-10 6 5834 5560800
Fig. 4: MLP based SNN benchmarks
0 0.5 1
Cluster Quality
0
0.05
0.1
0.15
0.2
Fr
ac
tio
n 
of
 C
lu
st
er
s
TraNNsformer
Offline Clustering
0 0.5 1
Cluster Quality
0
0.05
0.1
0.15
0.2
Fr
ac
tio
n 
of
 C
lu
st
er
s
TraNNsformer
Offline Clustering
0 0.5 1
Cluster Quality
0
0.05
0.1
0.15
0.2
Fr
ac
tio
n 
of
 C
lu
st
er
s
TraNNsformer
Offline Clustering
0.2 0.3 0.4 0.5
Cluster Quality
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Fr
ac
tio
n 
of
 C
lu
st
er
s
TraNNsformer
Offline Clustering
Fig. 5: Comparison of crossbar utilization (cluster quality) between Offline Clustering and TraNNsformer
Toolbox [18]. We analyzed the algorithmic benefits on SNNs
with MLP (fully-connected) topologies. To analyze the scal-
ability of the proposed framework, we used a range of ap-
plications with different complexity namely Digit Recognition
(MNIST dataset [13]), House Number Recognition (SVHN
dataset [16]) and Object Classification (CIFAR10 dataset [11]).
Different network architectures commensurate to the dataset
complexity were chosen for each application (shown in Fig.
4) to achieve high classification accuracy. Although we analyze
our results for SNN based benchmarks, the algorithmic bene-
fits would be similar for Artificial Neural Networks (ANNs).
This is because TraNNsformer works on optimizing the con-
nectivity structure between layers in a DNN, which is similar
for both SNN and ANN. Kindly note that ANN and SNN
(used in our case) differ only in the way inputs are transmitted
between layers.
We used the architecture proposed in [2] for studying the
system-level benefits of TraNNsformer on post-CMOS MCA
based architectures. For the memristive devices, we used a
resistance range of “20KΩ – 200KΩ” with 16 levels (4 bits) for
weight-discretization, that is typical of memristive technolo-
gies such as Phase Change Memory (PCM), Ag-Si [20]. We
considered an operating voltage of “Vdd/2” for the MCA as it
is interfaced with CMOS neurons [9]. To analyze the system
level benefits on CMOS based general-purpose architectures,
we use the energy numbers for arithmetic operations in a 45nm
CMOS process shown in [7]. The memory for weight storage
was modeled using CACTI [15].
TraNNsformer is targeted to improve the area and energy
consumption of DNNs during inference phase but has higher
training effort (in terms of time and energy consumption) than
network pruning. However, typical DNNs are trained very
infrequently but used for testing/inference for much longer
times.
V. RESULTS
In this section, we present the results of various experiments
that demonstrate the benefits of TraNNsformer (at algorithm
and system level) for DNN acceleration on post-CMOS MCA
based systems. Note that we have used normalized values
to report the area and energy benefits. This is because the
benefits from our proposed framework are orthogonal to the
benefits obtained from any particular choice/design of MCA
based architecture.
Additionally, we also evaluate the benefits obtained on
CMOS based systems to demonstrate the effectiveness of the
proposed framework towards translating the benefits of DNN
sparsity to reduced memory storage requirements.
A. Algorithm level analysis
Fig. 5 shows the fractional distribution of clusters formed
with respect to the cluster qualities for two different training
approaches namely TraNNsformer and offline clustering on
SVHN dataset. In both approaches, DNNs are trained to
achieve iso-accuracies. Cluster quality represents the number
of non-zero synapses present in a cluster. Hence, a cluster with
higher cluster quality will map to an MCA with high utilization
factor. As mentioned before, higher MCA utilizations help to
achieve higher area as well as energy efficiency on MCA
based architectures. Fig. 5 represents the MCA utilization
for a DNN with 70% average sparsity (across all layers)
for offline clustering. MCA utilization for a DNN trained
with network pruning (pruned DNN) is almost uniform across
all MCAs that are required for mapping. Consequently, the
pruned DNN maps to MCAs with 0.3 utilization factor. As
shown in Fig. 5, the pruned DNN upon undergoing offline
clustering significantly improves the MCA utilization across
all the layers. It can also be seen that, TraNNsformer further
improves the MCA utilization by a significant factor across
all layers in comparison to the offline clustering approach.
This underscores the fact that dynamic cluster formation
during the training process helps to get highly structured
DNN connectivities in comparison to both pruning and offline
clustering.
Fig. 6 shows that the fraction of unclustered synapses
remaining after the training process in offline clustering ap-
proach is significantly higher than TraNNsformer across all the
layers (shown by markers in Fig. 6) of DNN. A large number
of unclustered synapses is undesirable, as their mapping results
in large number of poorly utilized MCAs. Additionally, the
number of MCAs required for mapping unclustered synapses
are much higher than the number of MCAs mapped to the
clustered synapses thereby, diminishing the overall benefits
obtained from clustering. However, TraNNsformer based DNN
training leads to much smaller fraction of unclustered synapses
across all the layers in the DNN (except the last layer). Con-
sequently, the number of MCAs with low utilization factors
are insignificant with respect to the number of MCAs that are
mapped to the clustered synapses.
Layers of Neural Network
0
0.2
0.4
0.6
0.8
Un
cl
us
te
re
d 
Sy
na
ps
e 
Fr
ac
tio
n
TraNNsformer
Offline Clustering
Fig. 6: Comparison of fraction of unclustered synapses between
Offline Clustering and TraNNsformer (the data points correspond
to the layers of DNN). Note that the last fully connected layer
consists of a small fraction of synapses (<1%), thereby having
insignificant effect on overall unclustered synapse comparison.
Both lower fraction of unclustered synapses and higher
cluster quality are equally important factors in reducing the
number of MCAs required for mapping. A DNN with low
cluster quality would map the clustered synapses across a large
number of MCAs whereas a DNN with higher fraction of un-
clustered synapses consumes a large number of MCAs to map
the unclustered synapses. Thus, TraNNsformer optimizes both
these factors concurrently to reduce the total number of MCAs
required. Please note that similar results for cluster quality
distribution and fraction of unclustered synapses were obtained
for MNIST and CIFAR10 as well thereby underscoring the
scalability benefits of the proposed training framework with
respect to DNN size (number of layers, number of neurons in
each layer).
B. Area and energy comparisons on MCA based architecture
Fig. 7 shows the area consumption on MCA based ar-
chitecture for DNNs with iso-accuracies trained using four
different approaches namely 1) Original i.e. typical back-
propagation 2) Pruning (network pruning) 3) Offline clustering
and 4) TraNNsformer. The energy consumption has been
normalized with respect to the “Original” area consumption
for each dataset. It can be seen that TraNNsformer achieves
area savings of 28% – 55% (39% on average) compared to
the original DNN across all datasets. Furthermore, TraNNs-
former also achieves significant area reductions of 28% –
49% (37% on average) with respect to the pruned DNNs
across all datasets. This underscores the effectiveness of the
proposed TraNNsformer framework in preserving the benefits
of DNN sparsity at the hardware level. As mentioned before, a
DNN trained with network pruning has irregular (unstructured)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
MNIST SVHN CIFAR10
N
o
rm
a
li
z
e
d
 A
re
a
Original Pruning Offline Clustering TraNNsformer
Fig. 7: Comparison of area consumption on MCA based archi-
tecture for different DNN training approaches
sparsity, which results in the DNN being mapped across a large
number of MCAs with low utilization factors. Consequently,
the benefits of network sparsity does not improve the DNN’s
area efficiency. It can be seen that the area consumption for
DNNs trained with offline clustering is significantly higher
than the TraNNsformer case across all benchmarks. This
justifies the importance of a clustering driven DNN train-
ing approach towards achieving efficient implementation on
MCA based systems. It is also worth noting that the area
consumption in offline clustering case is irregular i.e. lower
or comparable to network pruning in some cases (CIFAR10)
while being higher in other cases (MNIST and SVHN). This is
because the larger fraction of unclustered synapses remaining
after clustering in SVHN gets mapped to a large number of
MCAs with very utilization factors thereby, worsening the area
consumption.
Fig. 8 shows the energy consumption per classification
obtained for DNNs trained using the four training approaches.
The energy consumption has been normalized with respect
to the “Original” energy consumption for each dataset. It
can be seen that TraNNsformer achieves significant energy
improvements of 49% – 67% (56% on average) across all
datasets. Furthermore, it also achieves 15% – 29% (20% on
average) energy reduction with respect to the pruned DNNs
across all datasets. As mentioned before, pruning decreases
the MCA energy component only while having minimal effect
on the peripheral energy component. However, TraNNsformer
based network sparsity translates to commensurate savings for
both MCA as well as peripheral energy components thereby,
leading to greater energy savings. It can also be seen that
0
0.2
0.4
0.6
0.8
1
1.2
MNIST SVHN CIFAR10
N
o
rm
a
li
z
e
d
 E
n
e
rg
y
Original Pruning Offline Clustering TraNNsformer
Fig. 8: Comparison of energy consumption on MCA based
architecture for different DNN training approaches
offline clustering approaches have higher energy consumption
(MNIST, SVHN) than the pruned DNNs owing to the higher
fraction of unclustered synapses.
C. Energy comparison of CMOS based general-purpose ar-
chitectures
0
0.2
0.4
0.6
0.8
1
1.2
MNIST SVHN CIFAR10
N
o
rm
a
li
z
e
d
 E
n
e
rg
y
Original Prune TraNNsformer
Fig. 9: Comparison of energy consumption on CMOS based
general-purpose architecture for different DNN training ap-
proaches
Fig. 9 shows the energy consumption per classification of
DNNs trained using three different training approaches 1)
Original i.e. typical back-propagation 2) Pruning (network
pruning) and 3) TraNNsformer. The energy consumption has
been normalized with respect to the “Original” energy con-
sumption for each dataset. It can be seen that both net-
work pruning and TraNNsformer achieve significant energy
reductions in comparison to the original DNN. However,
TraNNsformer is 20% – 57% (37% on average) more energy
efficient compared to the pruned DNN. This efficiency stems
from the significant reduction in memory storage requirement
for TraNNsformer based DNNs. A reduction in memory size
translates to commensurate savings in memory energy com-
ponents (both access and leakage). Additionally, as mentioned
before, the energy profiles for FC layers are memory energy
dominated which results in the memory energy savings being
translated to significant savings in overall energy consumption.
VI. CONCLUSIONS
The intrinsic compatibility of post-CMOS technologies with
biological primitives has ushered the usage of Memristive
Crossbar Arrays (MCAs) in neuromorphic systems in order
to achieve low-power acceleration of Deep Neural Networks
(DNNs). However, DNNs have multiple static (known before
training) connectivity patterns (for instance, CNN, MLP, RNN)
which are primarily application dependent. Further, techniques
to optimize a DNN by obtaining highly sparse connectivity
(network pruning) adds a high degree of dynamic variabil-
ity (not known before training) to the connectivity pattern.
This variability in connectivity pattern requires hardware-
aware mapping algorithms to maximize the area and the
energy benefits for MCA based systems. While rule based
mapping techniques can address the static variability, dy-
namic connectivity patterns are much more challenging to
map owing to their large degree of variability. Additionally,
an inefficient mapping algorithm prevents the algorithmic
benefits of sparsity to be preserved at the hardware level. In
this work, we proposed TraNNsformer an integrated training
framework that learns connectivity structures, which can be
efficiently mapped to MCAs while preserving the algorithmic
benefits of network sparsity. We also developed a technology-
aware clustering approach to produce efficient mappings for
any MCA size, permissible by the technology for reliable
operations. Furthermore, we show that TraNNsformer also
leads to energy reductions in CMOS based general-purpose
architectures, thereby proving the generality of the proposed
framework across different architecture styles. Our results on
a range of recognition applications suggest that TraNNsformer
is a promising framework to implement DNNs, providing
a scalable solution to designing large-scale neuromorphic
systems.
REFERENCES
[1] F. Akopyan et al Truenorth: Design and tool flow of a 65 mw 1 million
neuron programmable neurosynaptic chip. IEEE TCAD, 2015
[2] A. Ankit et al. Resparc: A reconfigurable and energy-efficient architec-
ture with memristive crossbars for deep spiking neural networks. ACM
DAC, 2017.
[3] S. Anwar et al. Structured pruning of deep convolutional neural
networks. arXiv preprint arXiv:1512.08571, 2015.
[4] P. Chi et al. Prime: A novel processing-in-memory architecture for
neural network computation in reram-based main memory. ACM ISCA,
2016.
[5] P. U. Diehl et al. Fast-classifying, high-accuracy spiking deep networks
through weight and threshold balancing. IEEE IJCNN, 2015.
[6] S. Han et al. Eie: efficient inference engine on compressed deep neural
network. ACM ISCA, 2016.
[7] S. Han et al. Learning both weights and connections for efficient neural
network. NIPS, 2015.
[8] S. H. Jo et al. Nanoscale memristor device as synapse in neuromorphic
systems. Nano letters, 2010.
[9] A. Joubert et al. Hardware spiking neurons design: Analog or digital?
IEEE IJCNN, 2012.
[10] A. Karpathy et al. Deep visual-semantic alignments for generating image
descriptions. CVPR, 2015.
[11] A. Krizhevsky et al. Learning multiple layers of features from tiny
images. 2009.
[12] A. Krizhevsky et al. Imagenet classification with deep convolutional
neural networks. NIPS, 2012.
[13] Y. LeCun et al. Gradient-based learning applied to document recogni-
tion. Proceedings of the IEEE, 1998.
[14] T. Mikolov et al. Recurrent neural network based language model.
Interspeech, 2010.
[15] N. Muralimanohar et al. Optimizing nuca organizations and wiring
alternatives for large caches with cacti 6.0. IEEE MICRO, 2007.
[16] Y. Netzer et al. Reading digits in natural images with unsupervised
feature learning. 2011.
[17] A. Y. Ng et al. On spectral clustering: Analysis and an algorithm. NIPS,
2001.
[18] R. B. Palm. Prediction as a candidate for learning deep hierarchical
models of data. Technical University of Denmark, 5, 2012.
[19] M. Prezioso et al. Training and operation of an integrated neuromorphic
network based on metal-oxide memristors. Nature, 2015.
[20] B. Rajendran et al. Specifications of nanoscale devices and circuits for
neuromorphic computational systems. IEEE Transactions on Electron
Devices, 2013.
[21] A. Sengupta et al. Proposal for an all-spin artificial neural network:
Emulating neural and synaptic functionalities through domain wall
motion in ferromagnets. IEEE TBioCAS, 2016.
[22] A. Shafiee et al. Isaac: A convolutional neural network accelerator with
in-situ analog arithmetic in crossbars. ACM ISCA, 2016.
[23] K. Simonyan et al. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
[24] M. Sundermeyer et al. Lstm neural networks for language modeling.
Interspeech, 2012.
[25] W. Wen et al. Learning structured sparsity in deep neural networks.
NIPS, 2016.
[26] W. Wen et al. An eda framework for large scale hybrid neuromorphic
computing systems. ACM DAC, 2015.
