Group Scissor: Scaling Neuromorphic Computing Design to Large Neural
  Networks by Wang, Yandan et al.
Group Scissor: Scaling Neuromorphic Computing Design
to Large Neural Networks
Yandan Wang
University of Pittsburgh
yaw46@pitt.edu
Wei Wen
University of Pittsburgh
wew57@pitt.edu
Beiye Liu
University of Pittsburgh
bel34@pitt.edu
Donald Chiarulli
University of Pittsburgh
don@pitt.edu
Hai (Helen) Li
University of Pittsburgh
hal66@pitt.edu
ABSTRACT
Synapse crossbar is an elementary structure in Neuromor-
phic Computing Systems (NCS). However, the limited size
of crossbars and heavy routing congestion impedes the NCS
implementations of big neural networks. In this paper, we
propose a two-step framework (namely, group scissor) to
scale NCS designs to big neural networks. The first step
is rank clipping, which integrates low-rank approximation
into the training to reduce total crossbar area. The second
step is group connection deletion, which structurally prunes
connections to reduce routing congestion between crossbars.
Tested on convolutional neural networks of LeNet on MNIST
database and ConvNet on CIFAR-10 database, our experi-
ments show significant reduction of crossbar area and rout-
ing area in NCS designs. Without accuracy loss, rank clip-
ping reduces total crossbar area to 13.62% and 51.81% in the
NCS designs of LeNet and ConvNet, respectively. Following
rank clipping, group connection deletion further reduces the
routing area of LeNet and ConvNet to 8.1% and 52.06%,
respectively.
1. INTRODUCTION
The record-breaking classification performance of deep neu-
ral networks (DNNs) [1] in recent years has stimulated the
fast-growing research on hardware design of neuromorphic
computing systems (NCS) [2][3][4][5][6][7]. NCS utilize de-
vices and circuit components to mimic the behaviors of neu-
ral networks to perform intelligent tasks, such as image clas-
sification, speech recognition and natural language process-
ing. Circuit-level and architecture-level NCS designs using
emerging memristor devices [8] and traditional CMOS tech-
nologies [3] are being explored.
In software applications, the depth of DNNs rapidly grows
from several layers to hundreds or even thousands of lay-
ers [9]. However, the scale of hardware design of NCS falls
far behind. A major critical issue that obstructs the scaling-
up of NCS to big neural networks is the limited synaptic
connection (e.g., crossbar) in hardware implementation. Ac-
cordingly, it results in heavy wire congestion (e.g., the rout-
ing between crossbars). Taking the memristor-based NCS as
an example, under the impact of IR-drop and process vari-
ations, both reading and writing reliability will be severely
degraded when the size of a memristor-based crossbar is be-
yond 64 × 64 [10][11]. The similar scenario can also be ob-
served in CMOS-based conventional designs. For example,
the IBM TrueNorth chip, as a pioneer in NCS design, limits
the size of neurosynaptic crossbars to 256 × 256 [3]. It is
inevitable to interconnect multiple crossbars to implement
modern big neural networks. The increasing scale of neu-
ral networks could quickly exhaust the resources of synapse
crossbars and deteriorate the wire congestion [12][13].
To solve those problems, [13] mapped logically-connected
cores to physically-adjacent cores to reduce spike communi-
cations. However, it only optimized the placement of cores
and cannot reduce the core consumption. The existing NCS
optimization based on traditional sparse neural networks can
alleviate the wire congestion [12]. However, they usually
separate the software sparsifying and hardware deployment,
which makes the optimization more challenging.
Unlike previous work, we propose a tow-step framework
– group scissor – to overcome above issues so as to scale
NCS designs to big neural networks. The first step is rank
clipping, which integrates low-rank approximation into the
training process of neural networks. It targets at reducing
the dimensions of connection arrays in a group-wise way and
therefore reducing the consumption of synapse crossbars in
NCS. The second step – group connection deletion – struc-
turally deletes/prunes groups of connections. The approach
tends to learn hardware-friendly sparse neural networks to
directly delete the routing wires between crossbars, with
controllable low hardware cost. Unlike [12] which evaluated
NCS by Hopfield networks using less challenging database,
we evaluate our group scissor by state-of-the-art convolu-
tional neural networks – LeNet and ConvNet – using MNIST
and CIFAR-10 database. Our experiments show, without
accuracy loss, rank clipping respectively reduces total cross-
bar area to 13.62% and 51.81% in LeNet and ConvNet, and
group connection deletion reduces the routing area to 8.1%
and 52.06%, respectively.
filter 1 filter n
…
input
convolutional layer
(a) (b)
… …
…
…
in
pu
t
output
…
…
…
Figure 1: The NCS designs for (a) a small convolu-
tional layer, and (b) a big layer.
ar
X
iv
:1
70
2.
03
44
3v
2 
 [c
s.N
E]
  1
7 J
un
 20
17
2. PRELIMINARY
Figure 1(a) illustrates an implementation of a convolu-
tional layer in neural networks using memristor-based cross-
bars (MBC), where memristors (a.k.a. synapses) in each col-
umn encode the weights of one filter [14]. The implementa-
tion of a fully-connected layer utilizes the similar structure,
but each column realizes the connections to one output neu-
ron. As aforementioned, the sizes of crossbars are limited.
So when implementing big neural networks, a high volume
of the interconnection of crossbars are required.
Figure 1(b) depicts a circuit-level implementation of a big
layer by tiling and interconnecting MBC [12]. As the scale
of modern neural networks grows, the high crossbar area
occupation and heavy routing congestion become critical is-
sues and seriously obstruct the scalability of the hardware
implementation.
3. THE GROUP SCISSOR FRAMEWORK
In this work, we propose the Group Scissor framework
to improve the scalability of neuromorphic computing de-
sign. The framework contains two steps: rank clipping for
crossbar area occupation reductions and group connection
deletion for routing congestion reduction. The details of the
proposed design will be described in this section. Moreover,
the estimations of circuit area and routing wires for MBC-
based neuromorphic design are formulated.
3.1 Rank Clipping
As discussed above, the high crossbar area occupation and
heavy routing congestion are the major obstacles in realiz-
ing big neural networks. In order to overcome these issues,
we propose to utilize low-rank approximation (LRA) to re-
duce the dimensions of weight (connection) matrices in big
neural networks. Low-rank approximation is a mathemat-
ical technology, which uses the product of smaller matri-
ces with reduced rank to approximate a given large matrix.
Specifically, an original weight matrix W ∈ RN×M can be
approximated as
W ≈ U ·VT = W˜, (1)
where U ∈ RN×K , VT ∈ RK×M , and K is the rank of the
approximation. When K << M , U and V are reduced to
skinny matrices. The total crossbar area occupation can be
reduced when the rank K satisfies
K <
NM
N +M
. (2)
There are various LRA techniques. Without losing gener-
ality, commonly used principal components analysis (PCA)
[15] and singular value decomposition (SVD) [11] are adopted
as the representatives in this work.
The PCA approach is formulated in Algorithm 1. The
essence of PCA is a linear projection from a high dimensional
space (wn ∈ RM ) to a lower dimensional subspace (un ∈
RK , K  M) to minimize the reconstruction error of W,
where wn and un is the n-th row of W and U, respectively,
and V is the basis of the subspace. The reconstruction error
is
eK =
||W− W˜||2
||W||2 =
∑M
m=K+1 λm∑M
m=1 λm
, (3)
where ||·|| is the Euclidean norm, namely Euclidean distance.
Algorithm 1: Principal Components Analysis (PCA)
Input : N ×M matrix W, and rank K
1 Get mean of rows wn ∀n ∈ [1...N ]: µ = 1N
∑N
n=1wn;
2 Centralize the data: replace each wn with wn − µ;
3 Calculate the M ×M covariance matrix: C = WTW
N−1 ;
4 Calculate the eigenvectors vm and eigenvalues λm of
covariance matrix C: Cvm = λmvm, ∀m ∈ [1...M ];
5 Project to subspace: N ×K matrix U = WV, where
V = [v1, ...vK ] is a M ×K matrix and v1...K are
eigenvectors corresponding to the largest K eigenvalues;
Output: N ×M approximation matrix W˜ =U·VT
Although LRA can approximately reconstruct the origi-
nal weights, small perturbation of weights can deteriorate
the classification accuracy. Table 1 compares the perfor-
mance of the original baseline design (“Original”) and the
low-rank neural networks which are directly decomposed by
PCA (“Direct LRA”). The accuracy drops rapidly after ap-
plying “Direct LRA”. Fine-tuning (retraining) the low-rank
neural networks can recover accuracy, but the optimal ranks
in all layers are unknown. More importantly, it is very time-
consuming to explore the entire design space by decompos-
ing and retraining a wide variety of neural networks. We
propose the LRA-based rank clipping, which can not only
successfully retain the accuracy but also can automatically
converge to the optimal low ranks in all layers. Low ranks in
Table 1 are actually obtained by our rank clipping method.
The key idea of rank clipping is illustrated in Figure 2.
Rather than direct LRA after training, we integrate LRA
into the training process and carefully clip some ranks with
small reconstruction errors after a fixed number of train-
ing iterations, say, S iterations. The gentle clipping induces
small reconstruction errors and thus slightly affect the classi-
fication accuracy. As such, the accuracy could be recovered
by the following S iterations. The iteration of clipping and
training not only avoids irremediable accuracy degradation
but also enables neural networks to gradually converge to
the optimal ranks for all layers.
Algorithm 2 describes the detailed operation of the rank
clipping. The tolerable clipping error ε is the allowed maxi-
mum reconstruction error after each rank clipping. A gentle
clipping can be enabled by setting a small ε, e.g., 0.01. Our
rank clipping starts with a full-rank LRA without recon-
struction errors, and iteratively examines if the low-dimensional
U can be further projected to a lower -rank subspace with
only reconstruction error of ε. Note that PCA is used as
the representative of LRA in Algorithm 2. However, other
LRA methods like SVD can also be used. The only modifi-
cation is to replace the approximation of weight matrix by
×≈
✂
 ✂ 
③ ② ① 
W U V
T 
① = Clipped ② = Clipping ③ = Remained 
③ 
② 
① 
Figure 2: Rank clipping for crossbar area occupation
reduction.
Table 1: Accuracy and ranks
Database Net Method Accuracy conv1† conv2 conv3 fc1† fc2
MNIST LeNet [16]
Original‡ 99.15%
Rank K
20 50 – 500 10
Direct LRA 96.44% 5 12 – 36 10
Rank clipping 99.14% 5 12 – 36 10
CIFAR-10 ConvNet [1]
Original 82.01%
Rank K
32 32 64 10 –
Direct LRA 43.29% 12 19 22 10 –
Rank clipping 82.09% 12 19 22 10 –
† conv1 is the first convolutional layer, fc1 is the first fully-connected layer, and so forth
‡ corresponding rank indicates the number of filters in convolutional layers or indicates the number of output
neurons in fully-connected layers.
Algorithm 2: Rank Clipping
Input : Trained original neural network, tolerable clipping
error ε, maximum training iteration I, clipping
step S
1 for each layer l do
2 PCA of weight matrix Wl = Ul ·VTl with full rank
Kl = Ml;
3 end
4 while i = 1; i < I; i = i+ S do
5 for each layer l do
6 PCA of Ul = Uˆl · VˆTl using the minimum rank Kˆ
which satisfies eKˆ ≤ ε;
7 if Kˆ < Kl then
8 Kl = Kˆ; Ul = Uˆl; V
T
l = Vˆ
T
l ·VTl
9 else
10 continue;
11 end
12 end
13 Train the neural network for S iterations;
14 end
Output: Clipped low-rank neural network with
approximation Wl = Ul ·VTl for each layer l
the other LRA method.
Figure 3 plots the trends of rank reduction and accuracy
retention during the rank clipping of LeNet in Table 1 using
PCA. Rank clipping is examined every S = 500 (denoted as
5e2 in x-axis title) iterations with ε = 0.03. In the figure,
the rank ratio is defined as the remained rank over full rank,
i.e., K/M . The figure demonstrates that ranks are rapidly
clipped at the beginning of iterations and converge to op-
timal low ranks. During the entire process, the accuracy
changes are limited to small fluctuations.
As shown in Figure 3 and Table 1, the proposed rank
0.8
0.85
0.9
0.95
1
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50 60
A
cc
ur
ac
y
R
an
k 
ra
tio
Training iteration (5e2)
conv1
conv2
fc1
Accuracy
Figure 3: Rank ratio of each layer and accuracy dur-
ing training with rank clipping.
… …
…
…
in
pu
t
output
…
…
…
×
×column group 
N
 in
pu
ts
 
K outputs 
Q 
P 
row 
group 
Figure 4: The group connection deletion.
clipping successfully reduces the ranks in both convolutional
layers and fully-connected layers without accuracy loss. The
crossbar area occupation of the entire LeNet (ConvNet) re-
duces to 13.62% (51.81%). Instead of PCA, when SVD is ap-
plied, the whole crossbar area can also be reduced to 32.97%
(55.64%) for LeNet (ConvNet), which indicates SVD is in-
ferior to PCA. Therefore, we mainly conduct experiments
using PCA approach. Note that the last layers of LeNet
and ConvNet are not clipped because the rank (M = 10) is
already very small and there is little improvement space.
3.2 Group Connection Deletion
The rank clipping can reduce the total number of required
crossbars, but a large number of crossbars will be still nec-
essary to implement modern big neural networks. The sec-
ond step of our group scissor framework – group connection
deletion aims at removing interconnections between synapse
crossbars so as to reduce the circuit-level routing congestion
and architecture-level inter-core communication for NCS.
Figure 4 gives the basic idea of group connection deletion.
An array of MBC are interconnected to implement a large
weight matrix U ∈ RN×K . Suppose the elementary synapse
crossbar has P inputs and Q outputs (P  N,Q  K), a⌈
N
P
⌉×⌈K
Q
⌉
array of crossbars must be interconnected to im-
plement U as illustrated in Figure 4. The implementation of
another matrix V shall follow the similar method. As mem-
ristors can be densely manufactured in the crossbar and the
area of each memristor cell is feature-size level, the routing
wires dominate the circuit area [12]. Suppose a row group
of connections in Figure 4 all have zero weights, implying
that those connections are removable, we can delete/prune
the wire routing to the input of this row group. Similarly,
the wire routing from the output of a column group can be
deleted when the column group of connections are all-zeros.
Our group connection deletion method actively deletes those
groups of connections during the learning of neural net-
works, meanwhile maintaining the classification accuracy at
the similar level.
We harness group Lasso regularization to delete groups
of connections. Group Lasso is an efficient regularization
in the study of structured sparsity learning [17][18]. With
group Lasso regularization on each group of weights, a high
percentage of groups can be regularized to all-zeros. In
our group connection deletion method, weights are split to
row groups and column groups as illustrated in the figure.
And group Lasso regularization is enforced on each group.
Mathematically, the minimization function for training neu-
ral network with group Lasso can be formulated as:
E(W) = ED(W) + λ ·
G(r)∑
g=1
||W(r)g ||+
G(c)∑
g=1
||W(c)g ||
 , (4)
where W is the set of weights in the whole neural network,
ED(W) is the original minimization function when train-
ing traditional neural networks. G(r) and G(c) respectively
denote the number of row groups and column groups, and
W(r)g and W(c)g are the sets of weights in the g-th row group
and column group, respectively. And
G(r)⋃
g=1
W(r)g =
G(c)⋃
g=1
W(c)g =W. (5)
λ is the hyper-parameter to control the trade-off between
classification accuracy and routing congestion reduction. A
larger λ can result in lower accuracy but larger reduction of
routing wires. During the back-propagation training with
Eq. (4), each weight w will be updated as
w ← w − η
(
∂ED(W)
∂w
+
λw
||W(r)i ||
+
λw
||W(c)j ||
)
, (6)
where η is the learning rate, i ∈ [1...G(r)], j ∈ [1...G(c)],
w ∈ W(r)i and w ∈ W(c)j .
With group connection deletion, we disconnect all the
zero-weighted connections and prune all the routing wires
connecting to all-zero row groups or column groups. After
deletion, we fine-tune (retrain) the structurally-sparse neu-
ral networks to improve accuracy.
Figure 5 plots the trends of deleted routing wires (i.e.,
all-zero row/column groups) and the classification accuracy
versus the iterations of group connection deletion. The dele-
tion process starts with the low-rank LeNet in Table 1 that
was already compressed by rank clipping. In Figure 5, we
only delete the matrices of U and V whose dimensions are
beyond the largest size of MBC. More design details shall be
presented in Section 3.3 and Section 4. Even for low-rank
neural networks, our method can delete the routing wires
dramatically, e.g., 93.9% interconnection wires are removed
in the crossbar array of fc1 v. Fine-tuning the deleted neural
networks attains the baseline accuracy (99.1%),
Note that compared with our method, it is more chal-
lenging to use traditional sparse neural networks to reduce
the routing wires. This is because its sparse weights are
0.8
0.85
0.9
0.95
1
0
20
40
60
80
100
1 11 21 31 41 51 61
A
cc
ur
ac
y
%
de
le
te
d
ro
ut
in
g
w
ire
s conv2_u
fc1_u
fc1_v
fc2_u
Accuracy
0 10 20 30 40 50 60
Training iterations (5e2)
Figure 5: The percentage of deleted routing wires
and accuracy during group connection deletion.
fc1 u and fc1 v is the low-rank matrix U and V of
fc1 after rank clipping, and so forth.
randomly distributed in the crossbar arrays and the corre-
sponding routing wire must be preserved as long as there
is one nonzero weight existing in the row group or column
group.
3.3 Area Estimation
In this section, we formulate the area estimation method
adopted for hardware evaluation in this work.
MBC area estimation : The use of MBC in NCS de-
sign has been extensively studied. As a critical component
in such a system, MBCs occupy a significant proportion of
whole design area. Each MBC is an ultra high density cross-
point structure formed by a set of memristors and wires. The
area of a memristor cell in MBC is 4F 2 under the state-of-
the-art technology [11], where F is the minimum feature size.
Restricted by the technology limitations, a feasible MBC im-
plementation only considers MBCs that are not larger than
64×64 [10]. To ensure the system reliability and robustness,
we only consider MBCs with dimensions constrained within
64×64 in the standard library. For those large weight matri-
ces in neural networks, their connections can be distributed
into many MBCs in the library as demonstrated in Figure 1.
Routing area estimation : Assume that the metal width
is Wm, the distance between two metals is Wd, and the
length of i-th wire between crossbars is Li. The total routing
area occupied by the wires can be roughly formulated as
Ar = (Wm +Wd)
Nw∑
i
Li. (7)
Here Nw is total wire count including electrostatic shielding
wires. Suppose the average wire length is linearly propor-
tional to Nw, the routing area is estimated as
Ar = αN
2
w, (8)
where α is a scalar.
4. EXPERIMENT
In this section, experiments are conducted to evaluate the
effectiveness of proposed rank clipping and group connection
deletion methods. All the experiments conducted in this
paper are based on the NCS implemented by MBC. The
related experiment parameters on memristor and MBC are
summarized in Table 2. We mainly implement two neural
networks – LeNet on MNIST and ConvNet on CIFAR-10.
The detailed network structures are summarized in Table 1.
4.1 MBC Area Reduction
Table 2: Experiment Parameters
Parameter value
memristor cell area 4F 2
maximum crossbar size 64×64
Wire length between two memristors 2F
In our experiments, we clip all the convolutional and fully-
connected layers, except the last classifier layer. The original
rank in the last layer is determined by the number of classes
so the further reduction is meaningless. The rank clipping
method compresses each large weight matrix to two skinny
matrices by reducing the rank. Figure 6 shows the final
remained ranks with respect to the accuracy and tolerable
clipping error ε for convolutional layers in LeNet. Here the
original rank of conv1 and conv2 is 20 and 50, respectively, as
denoted by upper markers on the stems. For each layer, the
rank decreases as  increases, and finally reaches to a very
small value. It can be seen that the corresponding accuracy
is well maintained. We also observe similar results in fc1.
More specifically, the layer-wise ranks are reduced to 5, 12
and 36 without accuracy loss, and to 4, 6 and 6 with merely
1% loss.
Figure 7(a) and (b) respectively plot the percentage of
remained MBC area with respect to the classification error
for LeNet and ConvNet. Routing area is excluded in this
evaluation. The area of each layer is the sum of the areas of
U and V. Total area includes the area of the last classifier
layer, i.e., fc2 in Lenet or fc1 in ConvNet. For both networks,
the layer-wise areas of both convolutional layers and fully-
connected layers rapidly reduce with small accuracy loss.
In summary, the rank clipping can reduce the total cross-
bar area of LeNet to 13.62% without sacrificing any accuracy
loss. The crossbar area can be further reduced to 3.78% with
merely 1% accuracy loss. For more challenging ConvNet,
no accuracy loss is observed when the crossbar area is re-
duced to 51.81%. And under an accuracy loss of 1%, the
total crossbar area can be reduced to 38.14%.
4.2 Routing Area Reduction
To evaluate the routing congestion alleviated by group
connection deletion method, we use the number of routing
wires and remained routing area of Eq. (8) as our metrics.
Although the estimation of routing area in the real circuit
can be more complex, the real routing area reduction in the
0
1
10
20
0.2
R
an
k 30
0.15
Accuracy
40
0.95
50
0.1
0.05
0.9 0
conv1
conv2
Figure 6: The remained ranks in convolutional lay-
ers of Lenet. fc1 is omitted for better visualization
as its original rank 500 is out of chart.
0%
20%
40%
60%
80%
100%
0.8% 1.4% 2.0% 2.6%
C
ro
ss
ba
r a
re
a
Error
conv1
conv2
fc1
total
17.5% 18.5% 19.5% 20.5%
Error
conv1
conv2
conv3
total
(a) (b)
Figure 7: The MBC area for (a) Lenet and (b) Con-
vNet, after applying the rank clipping.
Table 3: MBC Sizes and remained routing wires in
big layers.
Net type conv1 u conv2 u conv3 u fc1 u fc1 v fc last
LeNet
sizes – † 50× 12 – 50× 36 36× 50 50× 10
% wires – 47.5 – 24.8 6.7 18.0
ConvNet
sizes 25× 12 50× 19 50× 22 – – 64× 10
% wires 83.3 40.5 74.4 – – 81.9
† The weight matrix can be implemented by one crossbar.
conv1 v, conv2 v and conv3 v are omitted for the same reason.
hardware must be positively correlated to our results.
As aforementioned in Section 3.3, our standard library
contains all types of memristor crossbars with dimensions
constrained within 64 × 64. When implementing a N × K
weight matrix U, the MBC sizes are selected based on the
following criteria: (1) Implement U in a N ×K MBC, when
N ≤ 64 and K ≤ 64; (2) Implement U by an array of MBCs
whenN > 64 or K > 64, with the largest available MBC size
P ×Q, where N and K is divisible by P and Q, respectively.
In the experiments, the group connection deletion starts
with the rank-clipped LeNet or ConvNet without accuracy
loss as presented in Table 1. Based on the MBC selection
criteria, the sizes of MBC utilized in big layers are shown
in Table 3. Matrices with sizes constrained by 64 × 64 are
omitted in the table, and no group Lasso regularization is
enforced on those small matrices.
The experimental results of the remained routing wires af-
ter applying the group connection deletion without allowing
accuracy loss are also presented in Table 3. The results for
LeNet are remarkable. We can achieve the same accuracy
of the baseline, with routing wires being only 47.5%, 24.8%,
6.7% and 18.0% of the original ones in respective layer. This
can reduce the layer-wise routing area to 8.1%, on average.
Table 3 also shows that, in ConvNet, our method on aver-
age reduces layer-wise routing wires to 70.03% and thus re-
duce layer-wise routing areas to 52.06%, meanwhile achiev-
0%
25%
50%
75%
100%
0.175 0.18 0.185 0.19 0.195 0.2
R
em
ai
ne
d
ro
ut
in
g
w
ire
s
Classification error
conv1 conv2 conv3 fc1
0%
25%
50%
75%
100%
0.175 0.18 0.185 0.19 0.195 0.2
R
ou
tin
g
ar
ea
Classification error
conv1 conv2 conv3 fc1
(a) (b)
Figure 8: The (a) routing wire and (b) routing area
w.r.t. the classification error in ConvNet.
Figure 9: Weight matrices (transposed) after group connection deletion. The deletion starts from the rank-
clipped ConvNet in Table 1. Matrices are plotted in scale in the order of conv1 u, conv2 u, conv3 u and fc1.
White regions have no connections. And connections in each blue/red block are implemented in a crossbar.
ing the same accuracy as the baseline. With an accept-
able accuracy loss, the routing congestion can also be sig-
nificantly alleviated. Figure 8 comprehensively studies the
remained routing wires and routing area under different clas-
sification errors. With merely 1.5% accuracy loss, the rout-
ing area in each layer is reduced to 56.25%, 7.64%, 21.44%
and 31.64%, respectively.
At last, Figure 9 shows the sparse weight matrices after
group connection deletion for ConvNet in Table 3 without
accuracy loss. Each blue/red block stands for a collection of
weights, which are implemented by one crossbar in the NCS
design. White regions indicate that there are no connections.
After applying the group connection deletion, the connec-
tions in crossbars become sparse. More importantly, the
sparsity is structural instead of being randomly distributed
in traditional sparse neural networks. In the figure, a high
ratio of column groups in crossbars are regularized to all-
zeros, such that interconnection wires routing from those
crossbar columns can be removed. Impressively, as conv2 u
and fc1 in the figure show, some blocks have no connections
in the whole region, indicating that the entire crossbar can
be removed in the NCS implementation. It is significant be-
cause not only routing congestion can be alleviated, but also
crossbar area can be reduced. We also note that a crossbar
with some zero columns/rows can be replaced by a smaller
but dense crossbar after removing those zero groups, which
can further reduce the crossbar area.
5. CONCLUSIONS
In this work, we propose a framework named group scis-
sor to alleviate the impact of hardware limitations on the
NCS implementation of big neural networks. Specifically,
rank clipping and group connection deletion methods are
proposed to reduce area consumption of synapse crossbars
and routing area between crossbars, respectively. Final ex-
periments show that our methods can reduce crossbar area
(routing area) to 13.62% (8.1%) with no accuracy loss for
LeNet. Moreover, no accuracy loss is observed for more chal-
lenging ConvNet when crossbar area is reduced to 51.81%
and routing area is reduced to 52.06%. The proposed frame-
work can significantly save hardware area and improve sys-
tem scalability.
6. REFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
classification with deep convolutional neural networks,” in
NIPS, pp. 1097–1105, 2012.
[2] S. H. Jo, T. Chang, I. Ebong, B. B. Bhadviya,
P. Mazumder, and W. Lu, “Nanoscale memristor device as
synapse in neuromorphic systems,” Nano letters, vol. 10,
no. 4, pp. 1297–1301, 2010.
[3] S. K. Esser, A. Andreopoulos, R. Appuswamy, P. Datta,
D. Barch, A. Amir, J. Arthur, A. Cassidy, M. Flickner,
P. Merolla, et al., “Cognitive computing systems:
Algorithms and applications for networks of neurosynaptic
cores,” in IJCNN, pp. 1–10, 2013.
[4] C. Xu, X. Dong, N. P. Jouppi, and Y. Xie, “Design
implications of memristor-based rram cross-point
structures,” in DATE, pp. 1–6, 2011.
[5] M. Hu, H. Li, Y. Chen, Q. Wu, G. S. Rose, and R. W.
Linderman, “Memristor crossbar-based neuromorphic
computing system: A case study,” IEEE transactions on
neural networks and learning systems, vol. 25, no. 10,
pp. 1864–1878, 2014.
[6] B. Li, Y. Wang, Y. Wang, Y. Chen, and H. Yang, “Training
itself: Mixed-signal training acceleration for
memristor-based neural network,” in ASP-DAC,
pp. 361–366, 2014.
[7] W. Wen, C. Wu, Y. Wang, K. Nixon, Q. Wu, M. Barnell,
H. Li, and Y. Chen, “A new learning method for inference
accuracy, core occupation, and performance
co-optimization on truenorth chip,” in DAC, pp. 1–6, 2016.
[8] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S.
Williams, “The missing memristor found,” nature, vol. 453,
no. 7191, pp. 80–83, 2008.
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual
learning for image recognition,” arXiv:1512.03385, 2015.
[10] J. Liang and H.-S. P. Wong, “Cross-point memory array
without cell selectorsaˆA˘Tˇdevice characteristics and data
storage pattern dependencies,” IEEE Transactions on
Electron Devices, vol. 57, no. 10, pp. 2531–2538, 2010.
[11] B. Liu, H. Li, Y. Chen, X. Li, T. Huang, Q. Wu, and
M. Barnell, “Reduction and ir-drop compensations
techniques for reliable neuromorphic computing systems,”
in ICCAD, pp. 63–70, 2014.
[12] W. Wen, C.-R. Wu, X. Hu, B. Liu, T.-Y. Ho, X. Li, and
Y. Chen, “An eda framework for large scale hybrid
neuromorphic computing systems,” in DAC, p. 12, 2015.
[13] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza,
J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta,
G.-J. Nam, et al., “Truenorth: Design and tool flow of a 65
mw 1 million neuron programmable neurosynaptic chip,”
IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 34, no. 10,
pp. 1537–1557, 2015.
[14] L. Song, X. Qian, H. Li, and Y. Chen, “PipeLayer: A
pipelined ReRAM-based accelerator for deep learning,”
HPCA, 2017.
[15] I. Jolliffe, Principal component analysis. Wiley Online
Library, 2002.
[16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner,
“Gradient-based learning applied to document recognition,”
Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324,
1998.
[17] M. Yuan and Y. Lin, “Model selection and estimation in
regression with grouped variables,” Journal of the Royal
Statistical Society., vol. 68, no. 1, pp. 49–67, 2006.
[18] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning
structured sparsity in deep neural networks,” in NIPS,
pp. 2074–2082, 2016.
