Pre-Defined Sparse Neural Networks with Hardware Acceleration by Dey, Sourya et al.
1Pre-Defined Sparse Neural Networks
with Hardware Acceleration
Sourya Dey, Kuan-Wen Huang, Peter A. Beerel, Senior Member, IEEE, and Keith M. Chugg, Fellow, IEEE
Abstract—Neural networks have proven to be extremely pow-
erful tools for modern artificial intelligence applications, but
computational and storage complexity remain limiting factors.
This paper presents two compatible contributions towards re-
ducing the time, energy, computational, and storage complexities
associated with multilayer perceptrons. Pre-defined sparsity is
proposed to reduce the complexity during both training and
inference, regardless of the implementation platform. Our results
show that storage and computational complexity can be reduced
by factors greater than 5X without significant performance
loss. The second contribution is an architecture for hardware
acceleration that is compatible with pre-defined sparsity. This
architecture supports both training and inference modes and
is flexible in the sense that it is not tied to a specific number
of neurons. For example, this flexibility implies that various
sized neural networks can be supported on various sized Field
Programmable Gate Array (FPGA)s.
Index Terms—Machine learning, Neural network, Multilayer
perceptron, Sparsity, Hardware Acceleration
I. INTRODUCTION
NEURAL networks are critical drivers of new technolo-gies such as computer vision, speech recognition, and
autonomous systems. As more data have become available,
the size and complexity of neural network (NN)s has risen
sharply with modern NNs containing millions or even billions
of trainable parameters [1], [2]. These massive NNs come
with the cost of large computational and storage demands.
The current state of the art is to train large NNs on Graphical
Processing Unit (GPU)s in the cloud – a process that can take
days to weeks even on powerful GPUs [1]–[3] or similar pro-
grammable processors with multiply-accumulate accelerators
[4]. Once trained, the model can be used for inference which
is less computationally intensive and is typically performed on
more general purpose processors (i.e., Central Processing Unit
(CPU)s). It is increasingly desirable to run inference, and even
some re-training, on embedded processors which have limited
resources for computation and storage. In this regard, model
reduction has been identified as a key to NN acceleration by
several prominent researchers [5]. This is generally performed
post-training to reduce the memory requirements to store
the model for inference – e.g., methods for quantization,
compression, and grouping parameters [6]–[9].
Decreasing the time, computation, storage, and energy costs
for training and inference is therefore a highly relevant goal. In
The authors are with the Ming Hsieh Department of Electrical Engineering,
University of Southern California, Los Angeles, CA, 90089 USA e-mail:
{souryade, kuanwenh, pabeerel, chugg}@usc.edu
Manuscript submitted December 3, 2018.
This work is partly supported by NSF, Software and Hardware Foundations,
Grant 1763747.
this paper we present two compatible methods towards this end
goal: (i) a method for introducing sparsity in the connection
patterns of NNs, and (ii) a flexible hardware architecture that
is compatible with training and inference-only operation and
supports the proposed sparse NNs. Our approach to sparsifying
a NN is extremely simple and results in a large reduction
in storage and computational complexity both in training
and inference modes. Moreover, this method is not tied to
the hardware acceleration and provides the same benefits for
training and inference in software under the current paradigm.
The hardware architecture is massively parallel, but not tightly
coupled to a specific NN architecture (i.e., not tied to the
number of nodes in a layer). Instead, the architecture allows for
maximum throughput for a given amount of circuit resources.
Our approach to making a NN sparse is to specify a
sparse set of neuron connections prior to training and to
hold this pattern fixed throughout training and inference. We
refer to this method of simply excluding some fixed set of
connections in the NN as pre-defined sparsity. There are
several methods in the literature related to sparse NNs, but
most do not reduce the computation and storage complexity
associated with training, which is a primary goal of this work.
One related concept is drop-out [10] where selected edges in
the NN are not processed during some steps of the training
process, but the final result is a Fully-connected (FC) NN for
inference. Another set of approaches target producing a sparse
NN for inference, but use FC NNs during training. Among
these are pruning and trimming methods that post-process
the trained NN to produce a sparse NN for inference mode
[11]–[13]. As mentioned above, other methods have been
proposed for reducing the complexity of performing inference
on a trained FC NN such as quantization, compression, and
grouping parameters [6]–[9]. Other research has suggested
a method of learning sparsity during training that begins
training a fully-connected NN and uses a cost regularizer that
promotes sparsity in the trained model [14]. Note that all of
these methods do not substantially reduce the complexity of
training and instead target inference models that have lower
complexity. One method aimed at reducing both training and
inference complexity is using NNs with structured, but not
sparse, weight matrices [15], [16]. Finally, we note that several
authors have very recently proposed pre-defined sparse NNs
[17]–[19] independently of our published work [20]–[22].
Motivated by the fact that specialized hardware is typically
faster and more energy efficient than GPUs and CPUs, there
exists a large body of literature in NN hardware acceleration.
The vast majority of this addresses only inference given a
trained model [9], [23]–[26], with few addressing hardware
ar
X
iv
:1
81
2.
01
16
4v
1 
 [c
s.L
G]
  4
 D
ec
 20
18
2accelerated training [27]. The work of [27], for example,
targets a specific size NN – i.e., the logic and memory
architecture is tied to the number of neurons in a layer.
We propose an architecture that supports training, but can be
simplified for inference-only mode, and is flexible to the NN
size. This is particularly attractive for FPGA implementations.
Specifically, the proposed architecture produces the maximum
throughput on a given FPGA for a given NN and can therefore
support various sized NNs on various sized FPGAs. This is
accomplished by an edge-based processing architecture that
can process z edges in a given layer in parallel (i.e., we refer
to z as the degree of parallelism). A given FPGA can support
some largest value of z, and NNs with more edges will simply
take more clock cycles to process.1
Our edge-based architecture is inspired by architectures
proposed for iterative decoding of modern sparse-graph-based
error correction codes (i.e., Turbo and Low Density Parity
Check (LDPC) codes) (cf., [28], [29]). In particular, for a
given processing task, there are z logic units to perform the
task and z memories to store the quantities associated with
the task. A challenge with this architecture, shared between the
decoding and NN applications, is that, in order to achieve high-
throughput without memory duplication, the parallel memories
must be accessed in two manners: natural order and interleaved
order. In natural order, each computation unit is associated
with one memory and accesses the elements of that memory
sequentially. For interleaved order access, the z computational
units must access the memories such that no memory is
accessed more than once in a cycle. Such an addressing pattern
is called clash-free, and this property ensures that no memory
contention occurs so that no stalls or wait states are required.
For modern codes, the clash-free property of the memories is
ensured by defining clash-free interleavers (i.e., permutations)
[30], or clash-free parity check matrices [29]. In the context of
NNs, this clash-free property is tied to the connection patterns
between layers of neurons.
In addition to z degrees of parallelism in edge processing
in a given layer, our architecture is pipelined across layers.
Thus, there is a degree of parallelism associated with each
layer (i.e., zi for layer i) selected to set the number of cycles
required to process a layer to a constant – i.e., larger layers
have larger z so that the computation time of all layers is the
same. For an (L + 1)-layer NN there are L pipeline stages
so that a given NN input is processed in the time it takes
to complete the processing of the edges in a single layer.
Furthermore, the three operations associated with training
– Feedforward (FF), Backpropagation (BP), and Update of
Trainable Parameters (UP) – are performed in parallel. The
architecture may be simplified to perform only inference by
eliminating the logic and memory associated with BP and
UP. Furthermore, while the architecture supports the reduced
sparse complexity NNs, it is also compatible with traditional
FC networks. Interestingly, very recent work proposed pipelin-
ing across layers for an inference-only accelerator [31], as
well as a scalable edge-based architecture for training [32]
1We use the terms the terms ‘connection’ and ‘edge’ interchangeably, as we
do with ‘node’ and ‘neuron’. Also, the term ‘cycle’ will mean ‘clock cycle’,
unless otherwise stated.
independently of our published work [20], [21]. Neither of
these other recent works, however, takes advantage of pre-
defined sparsity in the network.
In Section II we provide motivation for and simple examples
of the effectiveness of pre-defined sparsity. In Section III the
hardware architecture is described in detail, including defining
a class of simple clash-free connection patterns with low
address generation complexity. Section IV contains a detailed
simulation study of pre-defined sparsity in NNs based on four
different classification datasets – MNIST handwritten digits
[33], Reuters news articles [34], TIMIT speech corpus [35],
and CIFAR-100 images [36]. We identify a set of trends or
design guidelines in this section as well. This section also
demonstrates that the simple, hardware-compatible clash-free
connection patterns provide performance on-par or better than
that of randomly connected sparse patterns. Finally, in Section
V we consider the issue of whether pre-defining the structured
sparse patterns causes a significant performance loss relative
to other sparse methods having similar amount of parameters.
We find that there is no significant performance degradation
and therefore our hardware architecture can provide training
and inference performance commensurate with state-of-the art
sparsity methods.
II. STRUCTURED PRE-DEFINED SPARSITY
A. Definitions, Notation, and Background
An (L + 1)-layer Multilayer Perceptron (MLP) has Ni
nodes in the ith layer, described collectively by the neuronal
configuration Nnet = (N0, N1, · · · , NL), where layer 0 is the
input layer. We use the convention that layer i is to the ‘right’
of layer i − 1. There are L junctions between layers, with
junction i connecting the Ni−1 nodes of its left layer i − 1
with the Ni nodes of its right layer i.
We define pre-defined sparsity as simply not having all
Ni−1Ni edges present in junction i. Furthermore, we define
structured pre-defined sparsity so that for a given junction i,
each node in its left layer has fixed out-degree – i.e., douti
connections to its right layer, and each node in its right layer
has fixed in-degree – i.e., dini connections from its left layer.
FC NNs have douti = Ni and d
in
i = Ni−1 with Ni−1Ni edges
in the ith junction, while a sparse NN has at least one junction
with less than this number of edges. The number of edges (or
weights) in junction i is given by |Wi| = Ni−1douti = Nidini .
The density of junction i is measured relative to FC and
denoted as ρi = |Wi|/(Ni−1Ni). The structured constraint
implies that the number of possible ρi values is equal to the
Greatest Common Divisor (gcd) of Ni−1 and Ni, as shown in
Appendix A. The overall density is
ρnet =
∑L
i=1 |Wi|∑L
i=1Ni−1Ni
(1)
Thus, specifying Nnet and the out-degree configuration
doutnet = (d
out
1 , · · · , doutL ) determines the density of each
junction and the overall density.
We will also consider random pre-defined sparsity, where
connections are distributed randomly given preset ρi values
without constraints on in- and out-degrees. In Sec. IV-B we
3show that random pre-defined sparsity is undesirable at low
densities because it may result in unconnected neurons.
The standard equations for FC NNs are well-known [37].
For a NN using structured pre-defined sparsity, only the
weights corresponding to connected edges are stored in mem-
ory and used in computation. This leads to the modified equa-
tions (2)–(4), where subscripts denote layer/junction numbers,
single superscripts denote neurons in a layer, and double super-
scripts denote (right neuron, left neuron) in a junction. The FF
processing proceeds left-to-right and computes the activations
ai and associated derivatives a˙i for each layer by applying an
activation function act(·) to a linear combination of biases bi,
junction weights Wi and preceding layer activations ai−1
h
(j)
i =
dini∑
f=1
W
(j,kf )
i a
(kf )
i−1 + b
(j)
i (2a)
a
(j)
i = act
(
h
(j)
i
)
(2b)
a˙
(j)
i =
da
(j)
i
dh
(j)
i
= ˙act
(
h
(j)
i
)
(2c)
Note that (2c) is used in training, but is not required in
inference mode. The BP computation is done only in training
and computes a sequence of error values from right-to-left
δ
(j)
L =
∂l(j)
(
a
(j)
L , y
(j)
)
∂h
(j)
L
(3a)
δ
(j)
i = a˙
(j)
i
douti∑
f=1
W
(kf ,j)
i+1 δ
(kf )
i+1
 (3b)
where l(j)
(
a
(j)
L , y
(j)
)
is the jth component of the loss func-
tion. Finally, stochastic gradient UP is given by
b
(j)
i ← b(j)i − ηδ(j)i (4a)
W
(j,k)
i ←W (j,k)i − ηa(k)i−1δ(j)i (4b)
where η is the learning rate. The parameters on left-hand-side
of (2)–(4) will be referred to as the network parameters, with
the weights and biases being the trainable parameters.
B. Motivation and Preliminary Examples
Pre-defined sparsity can be motivated by inspecting the
histogram for trained weights in a FC NN. There have been
previous efforts to study such statistics [3], [38], however,
not for individual junctions. Fig. 1 shows weight histograms
for each junction in both a 2-junction and 4-junction FC NN
trained on the MNIST dataset. Note that many of the weights
are zero or near-zero after training, especially in the earlier
junctions. This motivates the idea that some weights in these
layers could be set to zero (i.e., the edges excluded). Even
with this intuition, it is unclear that one can pre-define a set of
weights to be zero and let the NN learn around this pre-defined
sparsity constraint. Fig. 1(c) and (h) show that, in fact, this is
the case – i.e., this shows classification accuracy as a function
of the overall density ρnet for structured pre-defined sparsity.
Since the computational and storage complexity is directly
Fig. 1. Histograms of weight values in different junctions for FC NNs trained
on MNIST for 50 epochs, with (a-b) Nnet = (800, 100, 10), and (d-g)
Nnet = (800, 100, 100, 100, 10). Test accuracy shown in (c,h) for different
NNs with same Nnet and varying ρnet. The overall density ρnet is set by
reducing ρ1 since junction 1 has more weights close to zero in the FC cases
(circled).
proportional to the number of edges in the NN, operating at an
overall density of, for example, 50% results in a 2X reduction
in complexity both during training and inference. Detailed
numerical experiments in Section IV build on these simple
examples. However, before we proceed to those results, it is
important to consider a hardware architecture that can support
structured pre-defined sparsity and consider the additional
clash-free constraints placed on the connection patterns so that
these can be considered in the studies in Section IV.
III. HARDWARE ARCHITECTURE
In this section we describe the proposed flexible hardware
architecture outlined in the Introduction. The overall architec-
tural view is captured by Fig. 2: sub-figure (a) shows parallel
edge processing within a junction with degree of parallelism
3, (b) shows clash-free memory access, and (c) junction
pipelining and parallel processing of the three operations – FF,
BP, UP. The toy example in Fig. 2(a)-(b) is for Ni−1 = 6,
Ni = 3, ρi = 6/18 = 1/3, and zi = 3. Fig. 2(a) shows that
the zi = 3 blue edges are processed in parallel in one cycle,
while the pink edges are processed in parallel during the next
cycle. Fig. 2(b) shows how the zi = 3 FF processing logic
units access the memories in natural and interleaved order.
As described in detail in Sec. III-B, the interleaved order
access may represent reading of the activations {a(j)i−1} for
j ∈ {0, 1, 5} and the natural order access may correspond
to writing the computed activations {a(j)i } for j ∈ {0, 1, 2}.
On the next cycle, the remaining memory locations (i.e., the
white cells) will be accessed. Note that this illustrates a clash-
free connection pattern since each of the zi = 3 memories is
accessed no more than once in each cycle – i.e., one hit per
column on each access.
4Fig. 2. (a) Processing zi = 3 edges in each cycle (blue in cycle 0, pink in cycle 1) for some junction i. (b) Accessing zi = 3 memories – M0, M1 and
M2 shown as columns – from two separate banks, one in natural order (same address from each memory), the other in interleaved order. Clash-freedom is
achieved by accessing only one element from each memory. The accessed values are fed to zi = 3 processors to perform FF simultaneously. (c) Operational
parallelism in each junction (vertical dotted lines denote processing for one junction), and junction pipelining of each operation across junctions (horizontal
dashed lines) in a multi-junction NN. Subfigure (c) is modified from our previous conference publication [20, Fig. 2(c)]
The junction-based operation in Fig. 2(b) is repeated for
each junction in a pipeline. In particular, there are L pipeline
stages. For example, for the FF pipeline, while the first stage
is processing input vector n + L on junction 1, the second
stage is processing input vector n + L − 1 on junction 2.
The degree of parallelism for each junction is selected so that
the processing time for any operation (FF/BP/UP) is the same
for each junction. Thus the throughput, i.e., the frequency of
processing input samples, is determined by the time taken to
perform a single operation in a single junction.
In summary, the architecture is (i) edge-based and not tied
to a specific number of nodes in a layer, (ii) flexible in that
the amount of logic is determined by the degree of parallelism
which trades size for speed, and (iii) fully pipelined for the
parallel operations associated with NN training. Also note that
the architecture can be specialized to perform only inference
by removing the logic and memory associated with the BP
and UP operations, and the a˙i computation in (2c).
A key concern when implementing NNs on hardware is
the large amount of storage required. Several characteristics
regarding memory requirements guided us in developing the
proposed architecture. Firstly, since weight memories are the
largest, their number should be minimized. Secondly, having
a few deep memories is more efficient in terms of power
and area than having many shallow memories [39]. Thirdly,
throughput should be maximized without duplicating memo-
ries, hence the need for clash-free connection patterns.
In Sec. III-A, we describe junction pipelining design which
attempts to minimize weight storage resources. The memory
organization within a junction is described in Sec. III-B, and
is designed to minimize the number of memories for a given
degree of parallelism. Finally, clash-free access conditions are
developed in Sec. III-B and III-C, and a simple method for
implementing such patterns given in Sec. III-C.
A. Junction pipelining and Operational parallelism
Our edge-based architecture is motivated by the fact that all
three operations – FF, BP, UP – use the same weight values
for computation. Since zi edges are processed in parallel in
a single cycle, the time taken to complete an operation in
junction i is (Ci = |Wi| /zi) cycles. The degree of parallelism
configuration znet = (z1, · · · , zL) is chosen to achieve Ci =
C ∀ i ∈ {1, · · · , L}. This allows efficient junction pipelining
since each operation takes exactly C cycles to be completed
for each input in each junction, which we refer to as a junction
cycle.2 This determines throughput.
The following is an analysis of Fig. 2(c) in more detail
for an example NN with L = 2. While a new training input
numbered n+3 is getting loaded as a0, junction 1 is processing
the FF stage for the previous input n+ 2 and computing a1.
Simultaneously, junction 2 is processing FF and computing
cost δL via cost derivatives for input n + 1. It is also doing
BP on input n to compute δ1, as well as updating (UP)
its parameters from the finished δL computation of input n.
Simultaneously, junction 1 is performing UP using δ1 from the
finished BP results of input n−1.3 This results in operational
parallelism in each junction, as shown in Fig. 3. The combined
speedup is approximately a factor of 3L as compared to doing
one operation at a time for a single input.
Notice from Fig. 3 that there is only one weight memory
bank which is accessed for all three operations. However, UP
in junction 1 needs access to a0 for input n − 1, as per the
weight update equation (4b). This means that there need to be
2L + 1 = 5 left activation memory banks for storing a0 for
inputs n − 1 to n + 3, i.e., a queue-like structure. Similarly,
UP in junction 2 will need 2(L−1)+1 = 3 queued banks for
each of its left activation a1 and its derivative a˙1 memories
– for inputs from n (for which values will be read) to n + 2
(for which values are being computed and written). There also
need to be 2 banks for all δ memories – 1 for reading and the
other for writing. Thus junction pipelining requires multiple
memory banks, but only for layer parameters a, a˙ and δ, not
2During hardware implementation, a few extra cycles may be needed to
flush the pipeline so that Ci = |Wi| /zi + ci. These are also balanced,
i.e., ci = c ∀ i ∈ {1, · · · , L}, to achieve efficient pipelining. In our initial
implementation [40], for example, c = 2 and the junction cycle is C = 34.
3Note that BP does not occur in the first junction because there are no δ0
values to be computed
5Fig. 3. Architecture for parallel operations for an intermediate junction i
(i 6= 1, L) showing the three operations along with associated inputs and
outputs. Natural and interleaved order accesses are shown using solid and
dashed lines, respectively. The a and a˙ memory banks occur as queues, the
δ memory banks as pairs, while there is a single weight memory bank. Figure
modified from our previous conference publication [20, Fig. 3].
TABLE I
HARDWARE ARCHITECTURE TOTAL STORAGE COST COMPARISON FOR
Nnet = (800, 100, 10) FC VS SPARSE WITH doutnet = (20, 10), ρnet = 21%
Parameter Expression Count (FC) Count (sparse)
a
∑L−1
i=0 (2(L− i) + 1)Ni 4300 4300
a˙
∑L−1
i=1 (2(L− i) + 1)Ni 300 300
δ 2
∑L
i=1Ni 220 220
b
∑L
i=1Ni 110 110
W
∑L
i=1Nid
in
i 81000 17000
TOTAL Σ (All above) 85930 21930
for weights.4 The number of layer parameters is insignificant
compared to the number of weights for practical networks.
This is why pre-defined sparsity leads to significant storage
savings, as quantified in Table I for the circled FC point vs
the ρnet = 21% point from Fig. 1(c). Specifically, memory
requirements are reduced by 3.9X in this case. Furthermore,
the computational complexity, which is proportional to the
number of weights for a MLP, is reduced by 4.8X. For
this example, these complexity reductions come at a cost of
degrading the classification accuracy from 98.0% to 97.2%.
B. Memory organization
For the purposes of memory organization, edges are num-
bered sequentially from top to bottom on the right side of the
junction. Other network parameters such as a, a˙ and δ are
numbered according to the neuron numbers in their respective
layer. Consider Fig. 4 as an example, where junction i is
flanked by Ni−1 = 12 left neurons with douti = 2 and Ni = 8
right neurons, leading to |Wi| = 24 and dini = 3. The three
weights connecting to right neuron 0 are numbered 0, 1, 2;
the next three connecting to right neuron 1 are numbered 3,
4, 5, and so on. A particular right neuron connects to some
subset of left neurons of cardinality dini .
4This is achieved by making the weight memory dual-port, while a and a˙
are single-ported memories. The δ memories are also dual-ported due to the
exact manner in which we implemented this architecture on FPGA, refer to
[40] for full details.
Fig. 4. An example of processing inside junction i with zi = 4 memories
in the weight and left banks, and zi+1 = 2 memories in the right bank. The
banks are represented as numerical grids, each column is a memory, and the
number in each cell is the number of the edge / left neuron / right neuron
whose parameter value is stored in it. Edge are sequentially numbered on
the right (shown in curly braces). Four weights are read in each of the six
cycles with the first three colored blue, pink and green, respectively. These
represent sweep 0, while the next 3 (using dashed lines) colored brown, red
and purple, respectively, represent sweep 1. Clash-freedom leads to at most
one cell from each memory in each bank being accessed each cycle. Weight
and right memories are accessed in natural order, while left memories are
accessed in interleaved order.
Each type of network parameter is stored in a bank of
memories. The example in Fig. 4 uses zi = 4, i.e., 4 weights
are accessed per cycle. We designed the weight memory bank
to have the minimum number of memories to prevent clashes,
i.e., zi, and their depth equals Ci. Weight memories are read
in natural order – 1 row per cycle (shown in same color).
Right neurons are processed sequentially due to the weight
numbering. The number of right neuron parameters of a
particular type needing to be accessed in a cycle is upper
bounded by
⌈
zi/d
in
i
⌉
, which leads to zi+1 ≥
⌈
zi/d
in
i
⌉
in order
to prevent clashes in the right memory bank.5 For FF in Fig. 4
for example, cycles 0 and 1 finish computation of a(0)i and a
(1)
i
respectively, while cycle 2 finishes computing both a(2)i and
a
(3)
i . For BP or UP, everything remains same except for the
right memory accesses. Now δ(0)i and δ
(1)
i are used in cycle 0,
δ
(1)
i and δ
(2)
i in cycle 1, and δ
(2)
i and δ
(3)
i in cycle 2. Thus the
maximum number of right neuron parameters ever accessed
in a cycle is
⌈
zi/d
in
i
⌉
= 2.
Since edges are interleaved on the left, in general, the zi
edge processing logic units will need access to zi parameters
of a particular type from layer i − 1. So all the left memory
banks have zi memories, each of depth Di = Ni−1/zi, which
are accessed in interleaved order. For example, after Di cycles,
Ni−1 edges have been processed – i.e., (Di × zi) = Ni−1. We
require that each of these edges be connected to a different
5This does not limit most practical designs (see Appendix B).
6left neuron to eliminates the possibility of duplicate edges.
This completes a sweep, i.e., one complete access of the left
memory bank. Since each left neuron connects to douti edges,
douti sweeps are required to process all the edges, i.e., each left
activation is read douti times in the whole junction cycle. The
reader can verify that Di cycles multiplied by douti sweeps
results in Ci total cycles, i.e., one junction cycle.
C. Clash-free connection patterns
We define a clash as attempting to perform a particular
operation more than once on the same memory at the same
time, which would stall processing.6 The idea of clash-freedom
is to pre-define a pattern of connections and z values such that
no operation in any junction of the NN results in a clash. Sec.
III-B described how z values should be designed to prevent
clashes in the weight and right memory banks.
This subsection analyzes the left memory banks, which are
accessed in interleaved order. Their memory access pattern
should be designed so as to prevent clashes. Additionally,
the following properties are desired for practical clash-free
patterns. Firstly, it should be easy to find a pattern that gives
good performance. Secondly, the logic and storage required to
generate the left memory addresses should be low complexity.
We generate clash-free patterns by initially specifying the
left memory addresses to be accessed in cycle 0 using a
seed vector φi ∈ {0, 1, · · · , Di − 1}zi . Subsequent addresses
are cyclically generated. Considering Fig. 4 as an example,
φi = (1, 0, 2, 2). Thus in cycle 0, we access addresses
(1, 0, 2, 2) from memories (M0,M1,M2,M3), i.e., left neu-
rons (4, 1, 10, 11). In cycle 1, the accessed addresses are
(2, 1, 0, 0), and so on. Since Di = 3, cycles 3–5 access the
same left neurons as cycles 0–2.
We found that this technique results in a large number
of possible connection patterns, as discussed in Appendix
C. Randomly sampling from this set results in performance
comparable with non-clash-free NNs, as shown in Sec. IV-B.
Finally, our approach only requires storing φi and using zi
incrementers to generate subsequent addresses. This approach
is similar to methods used in modern coding to allow parallel
processing and memory accesses, c.f. [28]–[30]. Other tech-
niques to generate clash-free connection patterns are discussed
in Appendix C.
D. Batch size
It is common in training of NNs to use minibatches. For
a batch size of M , the UP operation in (4) is performed
only once for M inputs by using the average over the M
gradients. Our architecture performs an UP for every input
and therefore may be viewed as having batch size one.
However, the processing in our architecture differs from a
typical software implementation with M = 1 due to the
6For single-ported memories, attempting two reads or two writes or a read
and a write in the same cycle is a clash. For simple dual-ported memories
with one port exclusively for reading and the other exclusively for writing, a
read and a write can be performed in the same cycle. Attempting to perform
two reads or two writes in the same cycle is a clash.
Fig. 5. Processing the FC version of the junction from Fig. 4. For clarity, only
the first 12 and last 12 edges (dashed) are shown, corresponding respectively
to right neurons 0 and 7, sweeps 0 and 7, cycles 0–2 and 21–23.
pipelined and parallel operations. Specifically, in our archi-
tecture, FF and BP for the same input use different weights,
as implied by Fig. 2(c). In results not presented here, we
found no performance degradation due to this variation from
the standard backpropagation algorithm. There is considerable
ambiguity in the literature regarding ideal batch sizes [41],
[42], and we found that our current network architecture
performed well in our initial hardware implementation [40].
However, if a more conventional batch size is desired, the
UP logic can be removed from the junction pipeline and the
UP operation performed once every M inputs. This would
eliminate some arithmetic units at the cost of increased storage
for accumulating intermediate values from (4).
E. Special Case: Processing a FC junction
Fig. 5 shows the FC version of the junction from Fig. 4,
which has 96 edges to be accessed and operated on. This can
be done keeping the same junction cycle Ci = 6 by increasing
zi to 16, i.e., using more hardware. On the other hand, if
hardware resources are limited, one can use the same zi = 4
and pay the price of a longer junction cycle Ci = 24, as shown
in Fig. 5. This demonstrates the flexibility of our architecture.
Note that FC junctions are clash-free in all practical cases
due to the following reasons. Firstly, the left memory accesses
are in natural order just like the weights, which ensures that
no more than one element is accessed from each memory per
cycle. Secondly,
⌈
zi/d
in
i
⌉
= 1 for all practical cases since
zi ≤ Ni−1, as discussed in Appendix B, and dini = Ni−1
for FC junctions. This means that at most one right neuron
is processed in a cycle7, so clashes will never occur when
7In Fig. 5 for example, one right neuron finishes processing every 3rd cycle
7accessing the right memory bank.
Note that compared to Fig. 4, the weight memories in Fig. 5
are deeper since Ci has increased from 6 to 24. However, the
left layer memories remain the same size since Ni−1 = 12 and
zi = 4 are unchanged, but the left memory bank is accessed
more times since the number of sweeps has increased from 2
to 8. Also note that even if cycle 0 (blue) accesses some other
clash-free subset of left neurons, such as {4, 5, 6, 7} instead of
{0, 1, 2, 3}, the connection pattern would remain unchanged.
This implies that different memory access patterns do not
necessarily lead to different connection patterns; as discussed
further in Appendix C.
IV. OBSERVED TRENDS OF PRE-DEFINED SPARSITY
This section analyzes trends observed when experimenting
with several different datasets via software simulations. We
intend the following four trends to provide guidelines on
designing pre-defined sparse NNs.
1) Hardware-compatible, clash-free, pre-defined sparse pat-
terns perform at least as well as other pre-defined sparse
patterns (i.e., random and structured) (Sec. IV-B).
2) The performance of pre-defined sparsity is better on
datasets that have more inherent redundancy (Sec. IV-C).
3) Junction density should increase to the right: junctions
closer to the output should generally have more connec-
tions than junctions closer to the input (Sec. IV-D).
4) Larger and more sparse NNs are better than smaller
and denser NNs, given the same number of layers and
trainable parameters. Specifically, ‘larger’ refers to more
hidden neurons (Sec. IV-E).
The remainder of this section first describes the datasets we
experimented on, and then examines these trends in detail.
A. Datasets and Experimental Configuration
Unless otherwise noted, the following parameters and con-
figurations listed below were used for all presented results.
a) MNIST handwritten digits: We rasterized each input
image into a single layer of 784 features8, i.e., the permutation-
invariant format. No data augmentation was applied.
b) Reuters RCV1 corpus of newswire articles: The clas-
sification categories are grouped in a tree structure. We used
preprocessing techniques similar to [43] to isolate articles
which fell under a single category at the second level of the
tree. We finally obtained 328,669 articles in 50 categories, split
into 50, 000 for validation, 100, 000 for test, and the remaining
for training. The original data has a list of token strings for
each story, for example, a story on finance would frequently
contain the token ‘financ’. We chose the most common 2000
tokens and computed counts for each of these in each article.
Each count x was transformed into log(1 + x) to form the
final 2000-dimensional feature vector for each input.
8On certain occasions we added 16 input features which are always trivially
0 so as to get 800 features for each input. This leads to easier selection of
different sparse network configurations.
c) TIMIT speech corpus: TIMIT is a speech dataset com-
prising approximately 5.4 hours of 16 kHz audio commonly
used in Automatic Speech Recognition (ASR). A modern
ASR system has three major components: (i) preprocessing
and feature extraction, (ii) acoustic model, and (iii) dictionary
and language model. A complete study of an ASR system
is beyond the scope of this work. Instead we focus on the
acoustic model which is typically implemented using a NN.
The input to the acoustic model is feature vectors and the
output is a probability distribution on phonemes (i.e., speech
sounds). For our experiments, we used 25ms speech frames
with 10ms shift, as in [43], and computed a feature vector of 39
Mel-frequency Cepstral Coefficient (MFCC)s for each frame.
We used the complete training set of 818, 837 training samples
(462 speakers), 89, 319 validation samples (50 speakers), and
212, 093 test samples (118 speakers). We used a phoneme set
of size 39 as defined in [44].
d) CIFAR-100 images: Our setup for CIFAR-100 con-
sists of a Convolutional Neural Network (CNN) followed
by a MLP. The CNN has 3 blocks and each block has 2
convolutional layers with window size 3x3 followed by a
max pooling layer of pool size 2x2. The number of filters
for the six convolutional layers is (60,60, 125,125, 250,250).
This results in a total of approximately one million trainable
parameters in the convolutional portion of the network. Batch
normalization is applied before activations. The output from
the 3rd block, after flattening into a vector, has 4000 features.
Typically dropout is applied in the MLP portion, however we
omitted it there since pre-defined sparsity is an alternate form
of parameter reduction. Instead we found that a dropout prob-
ability of half applied to the convolutional blocks improved
performance. No data augmentation was applied.
For each dataset, we performed classification using one-hot
labels and measured accuracy on the test set as a performance
metric.9 We also calculated the top-5 test set classification
accuracy for CIFAR-100.
We found the optimal training configuration for each FC
setup by doing a grid search using validation performance
as a metric. This resulted in choosing ReLU activations for
all layers except for the final softmax layer. The initialization
proposed by He et al. [45] worked best for the weights; while
for biases, we found that an initial value of 0.1 worked best in
all cases except for Reuters, for which zeroes worked better.
The Adam optimizer [46] was used with all parameters set
to default, except that we set the decay parameter to 10−5
for best results. We used a batch size of 1024 for TIMIT and
Reuters since the number of training samples is large, and 256
for MNIST and CIFAR.
All experiments were run for 50 epochs of training and
regularization was applied as an L2 penalty to the weights.
To maintain consistency, we kept most hyperparameters the
same when sparsifying the network, but reduced the L2 penalty
9The NN in a complete ASR system would be a ‘soft’ classifier and feed the
phoneme distribution outputs to a decoder to perform ‘hard’ final classification
decisions. Therefore for TIMIT, we computed another performance metric
called Test Prediction Comparison (TPC), measured as KL divergence be-
tween predicted test output probability distributions of sparse vs the respective
FC case. Performance results obtained using TPC were qualitatively very
similar to test accuracy and not shown here.
8TABLE II
COMPARISON OF PRE-DEFINED SPARSE METHODS
doutnet ρnet% znet
Test Accuracy Performance
Clash-free Structured Random
MNIST: Nnet = (800, 100, 100, 100, 10), FC test accuracy = 98± 0.1
(80, 80, 80, 10) 80.2 (200, 25, 25, 4) 97.9± 0.2 97.9± 0.2 97.8± 0.2
(60, 60, 60, 10) 60.4 (200, 25, 25, 4) 97.6± 0.1 97.8± 0.1 97.6± 0.2
(40, 40, 40, 10) 40.6 (200, 25, 25, 5) 97.5± 0.1 97.7 97.6± 0.1
(20, 20, 20, 10) 20.8 (200, 25, 25, 10) 97.2± 0.2 97.2± 0.1 97.1± 0.1
(10, 10, 10, 10) 10.9 (200, 25, 25, 25) 96.7± 0.1 96.8± 0.2 96.7± 0.2
(5, 10, 10, 10) 6.9 (100, 25, 25, 25) 96.3± 0.1 96.3± 0.1 96.2± 0.1
(2, 5, 5, 10) 3.6 (80, 25, 25, 50) 95± 0.2 95.1± 0.1 95± 0.3
(1, 2, 2, 10) 2.2 (80, 20, 20, 100) 93.3± 0.3 93.1± 0.5 92± 0.3
Reuters: Nnet = (2000, 50, 50), FC test accuracy = 89.6± 0.1
(25, 25) 50 (1000, 25) 89.4± 0.1 89.3 89.4
(10, 10) 20 (400, 10) 87± 0.1 86.7± 0.1 86.5± 0.1
(5, 5) 10 (200, 5) 78.5± 0.5 78.2± 0.7 77.5± 0.6
(2, 2) 4 (80, 2) 53.3± 1.8 51.2± 1.7 46.8± 2.9
(1, 1) 2 (40, 1) 28.4± 2.4 28.7± 2.3 28± 1.9
TIMIT: Nnet = (39, 390, 39), FC test accuracy = 43.2± 0.2
(270, 27) 69.2
(13, 13)
43± 0.1 43 43± 0.1
(180, 18) 46.2 42.7± 0.1 42.8± 0.1 42.9± 0.1
(90, 9) 23.1 42.1± 0.1 42.5± 0.1 42.4± 0.1
(60, 6) 15.4 41.5± 0.1 41.8± 0.2 41.9± 0.1
(30, 3) 7.7 40.5± 0.2 40.1± 0.2 39.4± 0.8
CIFAR-100 a: Nnet = (4000, 500, 100), FC top-5 test accuracy = 87.1± 0.6
(100, 100) 22
(2000, 250)
87.5± 0.2 87.7± 0.2 87.4± 0.3
(29, 29) 6.4 86.8± 0.3 87.2± 0.5 87.1± 0.2
(12, 12) 2.6
(400, 50)
86.3± 0.2 86.5± 0.4 86.6± 0.4
(5, 5) 1.1 85.3± 0.5 85.5± 0.5 85.7± 0.3
(2, 2) 0.4
(80, 10)
84.1± 0.5 84.3± 0.3 83.8± 0.3
(1, 1) 0.2 83± 0.5 83.3± 0.4 81.7± 0.7
aFor CIFAR-100, given values of Nnet, doutnet , znet and ρnet are just for
the MLP portion, which follows a CNN as described in Sec. IV-A to form the
complete net. Reported values are top-5 test accuracies obtained from training
on the complete net.
coefficient with increasing sparsity. This was done because
sparse NNs have fewer trainable parameters and are less prone
to overfitting. We ran each experiment at least five times to
average out randomness and we show the 90% Confidence
Interval (CI)s for each metric as shaded regions (this also holds
for the results in Fig. 1(c,h)). In addition to the results shown,
we developed a data set of Morse code symbol sequences
and investigated pre-defined sparse NNs. While these results
are excluded for brevity, they are consistent with the trends
described in this Section, and can be found in [47].
B. Comparison of Pre-Defined Sparse Methods
Table II shows performance on different datasets for three
methods of pre-defined sparsity: a) the most restrictive and
hardware-friendly clash-freedom, b) structured, and c) random.
For the clash-free case, we experimented with different znet
settings to simulate different hardware environments:
• Reuters: One junction cycle is 50 cycles for all the differ-
ent densities. This is because we scale znet accordingly,
i.e., a more powerful hardware device is used for each
NN as ρnet increases.
• CIFAR-100 and MNIST: These simulate cases where
hardware choice is limited, such as a high-end, a mid-
range and a low-end device being available. Thus three
different znet values are used for CIFAR-100 depending
on ρnet.
• TIMIT: We keep znet constant for different densities.
Junction cycle length varies from 90 cycles for ρnet =
7.69% to 810 for ρnet = 69.23%. This shows that when
limited to a single low-end hardware device, denser NNs
can be processed in longer time by simply changing znet.
Table II confirms that hardware-friendly clash-free pre-
defined sparse architectures do not lead to any statistically
significant performance degradation. We also observed that
random pre-defined sparsity performs poorly for very low
density networks, as shown by the blue values. This is pos-
sibly because there is non-negligible probability of neurons
getting completely disconnected, leading to irrecoverable loss
of information.
C. Dataset Redundancy
Many machine learning datasets have considerable redun-
dancy in their input features. For example, one may not
need information from the ∼800 input features of MNIST
to infer the correct image class. We hypothesize that pre-
defined sparsity takes advantage of this redundancy, and will
be less effective when the redundancy is reduced. To test this,
we changed the feature vector for each dataset as follows.
For MNIST, Principal Component Analysis (PCA) was used
to reduce the feature count to the least redundant 200. For
Reuters, the number of most frequent tokens considered as
features was reduced from 2000 to 400. For TIMIT, we both
reduced and increased the number of MFCCs by 3X to 13 and
117, respectively. Note that the latter increases redundancy.
For CIFAR-100, a source of redundancy is the depth of
the CNN, which extracts features and discriminates between
classes before the MLP performs final classification. In other
words, the CNN eases the burden of the MLP. So a way to
reduce redundancy and increase the classification burden of the
MLP is to lessen the effectiveness of the CNN by reducing
its depth. Accordingly, we used a single convolutional layer
with 250 filters of window size 5×5 followed by a 8×8 max
pooling layer. This results in the same number of features,
4000, at the input of the MLP as the original network, but has
reduced redundancy for the MLP.
Classification performance results are shown in Fig. 6 as a
function of ρnet. For MNIST and CIFAR-100, the performance
degrades more sharply with reducing ρnet for the nets using
the reduced redundancy datasets. To explore this further,
we recreated the histograms from Fig. 1 for the reduced
redundancy datasets, i.e., a FC NN withNnet = (200, 100, 10)
training on MNIST after PCA. We observed a wider spread
of weight values, implying less opportunity for sparsification
(i.e., fewer weights were close to zero). Similar trends are
less discernible for Reuters and TIMIT, however, reducing
redundancy led to worse performance overall.
The results in Fig. 6 further demonstrate the effectiveness
of pre-defined sparsity in greatly reducing network complexity
with negligible performance degradation. For example, even
the reduced redundancy problems perform well when oper-
ating with half the number of connections. For CIFAR in
particular, FC performs worse than an overall MLP density of
around 20%. Thus, in addition to reducing complexity, struc-
tured pre-defined sparsity may be viewed as an alternative to
9Fig. 6. Comparison of classification accuracy as a function of ρnet for
different versions of datasets – original, reduced in redundancy by reducing
feature space (MNIST, Reuters, TIMIT) or performing less processing prior
to the MLP (CIFAR-100), and increasing redundancy by enlarging feature
space (TIMIT).
Fig. 7. Comparison of classification accuracy as a function of ρnet for
different ρL, where L = 2. Black-circled points show the effects of ρ2 when
ρnet is the same. Nnet values are (800, 100, 10) for MNIST, (2000, 50, 50)
for Reuters, and (4000, 500, 100) for the MLP in CIFAR-100.
dropout in the MLP for the purpose of improving classification
performance.
D. Individual junction densities
The weight histograms in Fig. 1 indicate that latter junc-
tions, particularly junction L closest to the output, have a wide
spread of weight values. This suggests that a good strategy
for reducing ρnet would be to use lower densities in earlier
junctions – i.e., ρ1 < ρL. This is demonstrated in Fig. 7 for the
cases of MNIST, CIFAR-100 and Reuters, each with L = 2
junctions in their MLPs. Each curve in each subfigure is for
a fixed ρ2, i.e., reducing ρnet across a curve is done solely
Fig. 8. Comparison of classification accuracy as a function of ρnet for:
(a) TIMIT with 39 MFCCs for the two cases where one junction is always
sparser than the other and vice-versa. Black-circled points show how reducing
ρ1 degrades performance to a greater extent. (b) TIMIT with 13 MFCCs for
different ρ1. (c,d) TIMIT with 117 MFCCs, and Reuters reduced to 400
tokens, for different ρ2. Nnet values are (a) (39, 390, 39), (b) (13, 390, 39),
(c) (117, 390, 39), (d) (400, 50, 50).
by reducing ρ1. For a fixed ρnet, the performance improves as
ρ2 increases. For example, the circled points in Reuters both
have ρnet = 4%, but the starred point with ρ2 = 100% has
approximately 40% better test accuracy than the pentagonal
point with ρ2 = 2%. The trend clearly holds for MNIST and
is also discernible for CIFAR-100.
We further observed that this trend (i.e., ρi+1 > ρi should
hold) is related to the redundancy inherent in the dataset and
may not hold for datasets with very low levels of redundancy.
To explore this, results analogous to those in Fig. 7 are
presented in Fig. 8 for TIMIT, but with varying sized MFCC
feature vectors – i.e., datasets corresponding to larger feature
vectors will contain more redundancy. The results in Fig. 8(c)
are for 117 dimensional MFCCs and are consistent with the
trend in Fig. 7. However, for a MFCC dimension of 13, this
trend actually reverses – i.e., the junction 1 should have higher
density. This is shown in Fig. 8(b), where each curve is for a
fixed ρ1. This reversed trend is also observed for the case
of 39 dimensional feature vectors, considered in Fig. 8(a),
where Nnet = (39, 390, 39). Due to this symmetric neuronal
configuration, for each value of ρnet on the x-axis in Fig.
8(a), the two curves have complementary values of ρ1 and
ρ2 (ρ1 6= ρ2) – e.g., the two curves at ρnet = 7.69% have
(ρ1, ρ2) pairs of (2.56%, 12.82%) and (12.82%, 2.56%). We
observe that the curve for ρ1 < ρ2 is generally worse than
the curve for ρ2 < ρ1, which indicates that junction 1 should
have higher density in this case.
Fig. 8(d) depicts the results for Reuters with the feature
vector size reduced to 400 tokens. While junction 2 is still
10
Fig. 9. Comparing ‘large and sparse’ to ‘small and dense’ networks for
MNIST with 784 features, with (a) Nnet = (784, x, 10) (on the left), and
(b) Nnet = (784, x, x, x, 10) (on the right). Solid curves (with the shaded
CIs around them) are for constant x, black dashed curves with same marker
are for same number of trainable parameters. The final junction is always
FC. Intermediate junctions for the L = 4 case have dout values similar to
junction 1.
more important (as in Fig. 7(c) for the original Reuters
dataset), notice the circled star-point at the very left of the
ρ2 = 100% curve. This point has very low ρ1. Unlike Fig.
7(c), it crosses below the other curves, indicating that it is more
important to have higher density in the first junction with this
less redundant set of features. We observed a similar, but less
prominent, trend in MNIST PCA when the feature dimension
was reduced to 200.
In summary, if an individual junction density falls below
a certain value, referred to as the critical junction density, it
will adversely affect performance regardless of the density of
other junctions. This explains why some of the curves cross
in Fig. 8. The critical junction density is much smaller for
earlier junctions than for later junctions in most datasets with
sufficient redundancy. However, the critical density for earlier
junctions increases for datasets with low redundancy.
E. ‘Large and sparse’ vs ‘small and dense’ networks
We observed that when keeping the total number of trainable
parameters the same, sparser NNs with larger hidden layers
(i.e., more neurons) generally performed better than denser
networks with smaller hidden layers. This is true as long as
the larger NN is not so sparse that individual junction densities
fall below the critical density, as explained in Sec. IV-D. While
the critical density is problem-dependent, it is usually low
enough to obtain significant complexity savings above it. Thus,
‘large and sparse’ is better than ‘small and dense’ for many
practical cases, including NNs with more than one hidden
layer (i.e., L > 2).
Fig. 9 shows this for networks having one and three hid-
den layers trained on MNIST. For the three layer network,
all hidden layers have the same number of neurons. Each
solid curve shows classification performance vs ρnet for a
particular Nnet, while the black dashed curves with identical
markers are configurations that have approximately the same
number of trainable parameters. As an example, the points
with circular markers (with a big blue ellipse around them)
in Fig. 9(b) all have the same number of trainable parameters
and indicate that the larger, more sparse NNs perform better.
Fig. 10. Comparing ‘large and sparse’ to ‘small and dense’ networks for
Reuters with 2000 tokens, with Nnet = (2000, x, 50). The x-axis is split
into higher values on the left (a), and lower values on the right in log scale
(b). Solid curves (with the shaded CIs around them) are for constant x, black
dashed curves with same marker are for same number of trainable parameters.
Junction 1 is sparsified first until its number of total weights is approximately
equal to that of junction 2, then both are sparsified equally.
Fig. 11. Comparing ‘large and sparse’ to ‘small and dense’ networks for (a)
TIMIT with 39 MFCCs and Nnet = (39, x, x, x, x, 39) (on the left), and (b)
CIFAR-100 with the deep 6-layer CNN and MLP Nnet = (4000, x, 100)
with log scale for the x-axis (on the right). Solid curves (with the shaded CIs
around them) are for constant x, black dashed curves with same marker are for
same number of trainable parameters (in the MLP portion only for CIFAR).
Since TIMIT has symmetric junctions, we tried to keep input and output
junction densities as close as possible and adjusted intermediate junction
densities to get the desired ρnet. CIFAR-100 is sparsified in a way similar to
Reuters in Fig. 10.
Specifically, the network with Nnet = (784, 112, 112, 112, 10)
and doutnet = (10, 10, 10, 10) corresponding to ρnet = 9.82%
performs significantly better than the FC network withNnet =
(784, 14, 14, 14, 10), and other smaller and denser networks,
despite each having 11500 trainable parameters. Increasing the
network size further to Nnet = (784, 224, 224, 224, 10), and
reducing ρnet to 4% to fix the number of trainable parameters
at 11500, leads to performance degradation. This is because
this ρnet was achieved by setting ρ2 = ρ3 = 2.68%, which
appears to be below the critical density.
Fig. 10 summarizes the analogous experiment on Reuters
with similar conclusions. Both subfigures are for the same
results with the x-axis split into higher and lower density range
(on log scale), to show more detail. Observe that the trend of
‘large and sparse’ being better than ‘small and dense’ holds
for subfigure (a), but reverses for (b) since densities are very
low (the black dashed curves have positive slope instead of
negative). This is due to the critical density effect.
Fig. 11(a) shows the result for the same experiment on
11
Fig. 12. Comparison of classification accuracy as a function of ρnet for differ-
ent sparse methods on (a) MNIST with Nnet = (800, 100, 10), (b) Reuters
with Nnet = (2000, 50, 50), and (c) TIMIT with Nnet = (39, 390, 39).
We set the overall density ρnet and all individual junction densities ρi to be
approximately the same across different sparse methods.
TIMIT with four hidden layers10. The trend is less clearly
discernible, but it exists. Notice how the black dashed curves
have negative slopes at appreciable levels of ρnet, indicating
‘large and sparse’ being better than ‘small and dense’, but high
positive slopes at low ρnet, indicating the rapid degradation in
performance as density is reduced beyond the critical density.
This is exacerbated by the fact that TIMIT with 39 MFCCs is
a dataset with low redundancy, so the effects of very low ρnet
are better observed.
Fig. 11(b) for the MLP portion of CIFAR-100 shows similar
results as TIMIT, but on a log x-scale for more clarity. As
noted in Sec. IV-C, the best performance for a given Nnet
occurs at an overall density less than 100%. It appears that for
any Nnet for CIFAR-100, peak performance occurs at around
10–20% overall MLP density. In experiments not shown here,
we obtained similar results for the reduced redundancy net
with a single convolutional layer.
V. COMPARISON TO OTHER SPARSE NN METHODS
Numerical results in Sec. IV showed that hardware-
compatible clash-free connection patterns performed as well as
structured and random pre-defined sparse connections. In this
section, we compare clash-free patterns against two sparsity
approaches that are less constrained than the structured pre-
defined sparsity considered in Sec. IV. In particular, both
approaches remove the constraint of regular degree – i.e., these
approaches yield sparse NNs that have varying douti and d
in
i
selected to optimize classification performance.
10We also performed experiments on TIMIT with one hidden layer (L = 2)
and Reuters with 2 hidden layers (L = 3). Results were similar to those
shown, so are not shown for brevity’s sake.
A. Attention-based Preprocessed Sparsity
Previous works [48], [49] have applied the concept of at-
tention on object recognition and image captioning to achieve
better performance with fewer parameters and less computa-
tion. We simplify this idea by computing the variance of input
features as attention and setting the out-degree of the neurons
of the input layer based on this value, Specifically, the feature
variances are quantized into three levels, and input neurons
with higher attention are assigned more connections than those
with lower attention. For the neurons in latter layers, we use
uniform out-degree and in-degree.
B. Learning Structured Sparsity during Training
While the method in Sec. V-A obtains a non-uniform neuron
out-degree for the first layer, it only considers the properties
of the dataset and not the learning process. We also compared
against the method of Learning Structured Sparsity (LSS)
which learns a good sparse connection pattern during training.
This method was proposed in [14] and prunes the connections
during training by using a sparse-promoting penalty function
as part of the objective function. Example penalty functions
include L1 and L1/L2 used in Lasso [50] and group-Lasso
[51], respectively. During training, the optimizer minimizes
a balancing objective comprising the loss function l(·)11, the
regularizer r(·), and a sparse-promoting penalty function p(·),
min
{Wi,bi}Li=1
l
(
{Wi, bi}Li=1
)
+ λr
(
{Wi}Li=1
)
+
L∑
i=1
γip(Wi)
(5)
where the penalty coefficients {γi}Li=1 control the density of
each junction. Increasing γi decreases ρi, however, obtaining
a specific value of ρi requires experimental tuning of γi.
In the results presented in this section, we used L1 as the
element-wise sparse-promoting penalty function and L2 as the
regularizer. Note that, in contrast to the attention-based method
and the structured pre-defined sparsity approach, LSS is not
a pre-defined sparsity method. Instead training in LSS begins
with a FC network, which means that training complexity is
similar to that of a FC NN. At the end of the LSS training
process, weights with absolute value below a threshold are set
as zero to achieve the target density.
C. Performance comparison
Fig. 12 compares performance versus ρnet of different
sparse NNs on MNIST, Reuters, and TIMIT. The individual
density of each junction with the attention-based preprocessed
sparse method is set to be identical to the density of each
junction using clash-free pre-defined sparse method. However,
the density of the nets using the LSS method can be tuned only
with the penalty coefficients. We tuned these to approximate
match the density of the other methods.12
11Here we emphasize that the loss function depends on all of the trainable
parameters in the network, as opposed to the output layer activations and
ground truth labels as done in Sec. II-A. This is to emphasize that loss is a
function of all of the trainable parameters and therefore the loss function can
promote sparsity by driving some edge weights to zero.
12This is why ρnet values of the green curves do not perfectly align with
the pre-defined sparsity curves.
12
The LSS method performs best among all sparse methods,
which is to be expected as it is the least constrained and also
discovers a good sparse connection pattern during training.
However, the performance with clash-free pre-defined spar-
sity is near that of the attention-based and LSS methods –
i.e., within 2% in terms of test accuracy at ρnet = 20%. We
conclude that even though the clash-free patterns are highly
structured and pre-defined, there is no significant performance
degradation when compared to advanced methods for pro-
ducing sparse models by exploiting specific properties of the
dataset or learning sparse patterns during training.
VI. CONCLUSIONS AND FUTURE WORK
In this work we proposed a new technique for complexity
reduction of neural networks – pre-defined sparsity – in
which a fixed sparse connection pattern is enforced prior to
training and held fixed during both training and inference.
We presented a hardware architecture suited to leverage the
benefits of structured pre-defined sparsity, capable of parallel
and pipelined processing. The architecture can be used for
both training and inference modes, and supports networks of
arbitrary density, including conventional fully-connected ones.
Flexibility is afforded by the degree of parallelism znet, which
trades hardware complexity for speed. Simple methods for
clash-free memory access are presented and these methods
are shown to achieve performance on par with the best known
methods for obtaining sparse MLPs.
Using extensive numerical experiments, we identified trends
which help in designing pre-defined sparse networks. Firstly, it
is better to allocate connections in a structured manner rather
than randomly. Secondly, for most datasets with high redun-
dancy, earlier junctions can be made more sparse. Thirdly, it
is better to have more neurons in the hidden layers, and then
sparsify aggressively to keep the number of edges low and
reduce complexity.
As motivated in the Introduction, the rapidly growing com-
plexity associated with modern NNs is a major challenge.
Pre-defined sparsity is a simple method to help address this
challenge, as is acceleration with custom hardware. Interesting
areas for future research include analytical approaches to
justify the trends observed in this work and improving our
initial hardware implementation in [40]. It is also interesting to
consider extending the methods introduced herein to convolu-
tional layers and recurrent architectures. Finally, truly speeding
the training process by orders of magnitude would allow more
extensive search over NN architectures and therefore a better
understanding of the largely empirical process of NN design.
APPENDIX A
STRUCTURED PRE-DEFINED SPARSITY CONSTRAINTS
In our structured pre-defined sparse network, ρi, the density
of junction i, cannot be arbitrary, since ρi = douti /Ni =
dini /Ni−1, where d
out
i and d
in
i are natural numbers satisfying
the equation Ni−1douti = Nid
in
i . Therefore, the number of pos-
sible ρi values is the same as the number of
(
douti , d
in
i
)
values
satisfying the structured pre-defined sparsity constraints:
douti =
Nid
in
i
Ni−1
, dini ≤ Ni−1, douti , dini ∈ N (6)
where N denotes the set of natural numbers.
The smallest value of dini which satisfies d
out
i ∈ N is
Ni−1/gcd(Ni−1, Ni), and other values are its integer multi-
ples. Since dini is upper bounded by Ni−1, the total number of
possible
(
douti , d
in
i
)
is gcd(Ni−1, Ni). Thus, the set of possible
ρi is {
ρi ∈ (0, 1]
∣∣∣∣ ρi = kgcd(Ni−1, Ni) , k ∈ N
}
. (7)
As a concrete example, consider a NN with Nnet =
(117, 390, 13). The number of possible densities of the
junctions are determined by gcd(117, 390) = 39 and
gcd(390, 13) = 13. Therefore, the sets of junction densities
are
ρ1 ∈
{
1
39
,
2
39
, · · · , 39
39
}
, ρ2 ∈
{
1
13
,
2
13
, · · · , 13
13
}
. (8)
APPENDIX B
HARDWARE ARCHITECTURE CONSTRAINTS
The depth of left memories in our hardware architecture is
Di = Ni−1/zi. Thus Ni−1 should preferably be an integral
multiple of zi. This is not a burdening constraint since the
choice of zi is independent of network parameters and depends
on the capacity of the device. In the unusual case that this
constraint cannot be met, the extra cells in memories can be
filled with dummy values such as 0.
There are also 2 conditions placed on the z values to
eliminate stalls in processing: for all layers i ∈ {1, · · · , L}, (i)
|Wi| /zi = C, and (ii) zi+1 ≥
⌈
zi/d
in
i
⌉
. Using the definitions
from Sec. II-A, (i) is equivalent to zi+1 = zidouti+1/d
in
i . Then,
(ii) can be equivalently written as
douti+1 ≥
dini
zi
⌈
zi
dini
⌉
(9)
which needs to be satisfied ∀ i ∈ {1, · · · , L− 1}. In practice,
it is desirable to design zi/dini to be an integer so that an
integral number of right neurons finish processing every cycle.
This simplifies hardware implementation by eliminating the
need for additional storage, for example, of the intermediate
activation values during FF. In this case, (9) reduces to douti+1 ≥
1, which is always true.
For non-integral zi/dini , there are two cases. If zi > d
in
i , (9)
reduces to douti+1 ≥ 2. On the other hand, if zi < dini , there is
no bound on the right hand side of (9). In general, note that
(9) becomes a burdening constraint only if dini is large, and
douti+1 and zi are both desired to be small. This corresponds
to earlier junctions being denser than later, which is typically
not desirable according to the observations in Sec. IV-D, or to
very limited hardware resources. We thus conclude that (9) is
not a limiting constraint in most practical cases.
APPENDIX C
CLASH-FREE PATTERNS
Specifying Ni−1, Ni, dini and zi for junction i in a clash-free
structured pre-defined sparse NN does not uniquely define a
connection pattern (unless it is FC). This section discusses the
number of possible left memory access patterns SMi for such
13
Fig. 13. (a-c) Various types of clash-freedom, and (d) memory dithering for
type 3, using the same left neuronal structure from Fig. 4 as an example.
The grids represent different access patterns for the same memory bank. The
number in each cell represents the left neuron number whose parameter is
stored in that cell. Cells sharing the same color are read in the same cycle.
a junction i. Note that the total number of possible memory
access patterns for the complete NN is SM =
∏L
i=1 SMi .
When zi ≥ dini , which is expected to be true for practical
cases of implementing sparse NNs on powerful hardware de-
vices, SMi is also equal to the number of possible connection
patterns SCi , which is the key quantity of interest. This is
because if zi ≥ dini , at least one right neuron is completely
processed in some cycle. Thus, changing the left memory
access pattern will change the left neurons to which that right
neuron connects, thereby changing the connection pattern.
This one-to-one correspondence results in SMi = SCi .
For the case of zi < dini , a FC junction provides an example
where SMi 6= SCi . Specifically, in this case SCi = 1 as there is
only one way to fully connect all neurons, but there are many
clash-free memory access patterns, as shown in the following
equations (10)-(12).
We now discuss various types of clash-freedom, and SMi
arising from each:
• Type 1: This is as described in Sec. III-C, and reca-
pitulated in Fig. 13(a). SMi is the number of ways of
designing φi, i.e.,
SMi = Di
zi (10)
• Type 2 (implemented in our earlier work [40]): In this
technique, a new φi is defined for every sweep. Con-
sidering the example in Fig. 13(b), φi = (1, 0, 2, 2) for
sweep 0, but (2, 0, 0, 0) for sweep 1. There will be douti
different φi vectors for each junction, resulting in:
SMi = Di
zid
out
i (11)
• Type 3: In this technique, the constraint of cyclically
accessing the left memories is also eliminated. Instead,
any cycle can access any cell from each of the memories.
This means that storing φi is not enough, the entire
TABLE III
COMPARISON OF CLASH-FREE METHODS FOR A SINGLE JUNCTION i
WITH (Ni−1, Ni, douti , d
in
i , zi) = (12, 12, 2, 2, 4)
Type
Memory
SMi
Storage Cost to Compute
Dithering Memory Addresses
1
No 81 zi = 4
Yes 486 2zi = 8
2
No 6561 zidouti = 8
Yes 236k 2zidouti = 16
3
No 1.68M Ni−1douti = 24
Yes 60M (Ni−1 + zi) douti = 32
sequence of memory accesses needs to be stored as a
matrix Φi ∈ {0, 1, · · · , Di − 1}Di×zi . In Fig. 13(c) for
example, Φi = ((1, 0, 2, 2), (0, 2, 1, 0), (2, 1, 0, 1)) for
sweep 0. Every sweep would also have a different Φi,
resulting in:
SMi = (Di!)
zid
out
i (12)
A technique that can be applied to all the types of clash-
freedom is memory dithering, which is a permutation of the
zi memories (i.e., the columns) in a bank. This permutation
can change every sweep, as shown in Fig. 13(d). Memory
dithering incurs an additional address computation storage cost
because of the zi permutation, but increases SMi by a factor
Ki. If dini /zi is an integer, an integral number of cycles are
required to process each right neuron. Since a cycle accesses
all memories, dithering has no effect and Ki = 1. On the
other hand, if zi/dini is an integer greater than 1, the effects
of dithering on connectivity patterns are only observed when
switching from one right neuron to the next within a cycle.
This results in
Ki =
 zi!
dini !
zi
din
i
d
out
i
(13)
for types 2 and 3, and the douti exponent is omitted for type
1 since the access pattern does not change across sweeps.
When either of zi or dini does not perfectly divide the other,
an exact value of Ki is hard to arrive at since some proper or
improper fraction of right neurons are processed every cycle.
In such cases, Ki is upper-bounded by (z!)
douti .
Table III compares the count of possible left memory access
patterns and associated storage cost for computing memory
addresses for types 1–3, with and without memory dither. The
junction used is the same as in Fig. 4, except Ni is raised to
12 such that dini becomes 2 and allows us to better show the
effects of memory dithering.
REFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Proc. Advances in Neural
Information Processing Systems 25 (NIPS), 2012, pp. 1097–1105.
[2] A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro,
“Deep learning with COTS HPC systems,” in Proc. 30th Int. Conf.
Machine Learning (ICML), vol. 28, 2013, pp. III–1337–III–1345.
[3] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and
connections for efficient neural networks,” in Proc. Advances in Neural
Information Processing Systems 28 (NIPS), 2015, pp. 1135–1143.
14
[4] N. P. Jouppi, C. Young, N. Patil et al., “In-datacenter performance
analysis of a tensor processing unit,” in 2017 ACM/IEEE 44th Annu.
Int. Symp. Computer Architecture (ISCA), June 2017.
[5] C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolutions,” in
IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015,
pp. 1–9.
[6] Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing deep
convolutional networks using vector quantization,” in arXiv preprint
arXiv:1412.6115, 2014.
[7] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen,
“Compressing neural networks with the hashing trick,” in Proc. 32nd
Int. Conf. Machine Learning (ICML), 2015.
[8] S. Han, H. Mao, and W. Dally, “Deep compression: Compressing deep
neural networks with pruning, trained quantization and huffman coding,”
in Proc. Int. Conf. Learning Representations (ICLR), 2016.
[9] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and
W. J. Dally, “EIE: Efficient inference engine on compressed deep neural
network,” in Proc. 43rd Int. Symp. Computer Architecture (ISCA), 2016.
[10] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
dinov, “Dropout: A simple way to prevent neural networks from overfit-
ting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958,
2014.
[11] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and
A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network
computing,” in Proc. 2016 ACM/IEEE 43rd Annu. Int. Symp. Computer
Architecture (ISCA), 2016, pp. 1–13.
[12] B. Reagen, P. Whatmough, R. Adolf et al., “Minerva: Enabling low-
power, highly-accurate deep neural network accelerators,” in Proc. 2016
ACM/IEEE 43rd Annu. Int. Symp. Computer Architecture (ISCA), 2016,
pp. 267–278.
[13] A. Aghasi, A. Abdi, N. Nguyen, and J. Romberg, “Net-trim: Convex
pruning of deep neural networks with performance guarantee,” in Proc.
Advances in Neural Information Processing Systems 30 (NIPS), 2017.
[14] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured spar-
sity in deep neural networks,” in Proc. Advances in Neural Information
Processing Systems 29 (NIPS), 2016, pp. 2074–2082.
[15] V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for
small-footprint deep learning,” in Proc. Advances in Neural Information
Processing Systems 28 (NIPS), 2015, pp. 3088–3096.
[16] S. Wang, Z. Li, C. Ding, B. Yuan, Y. Wang, Q. Qiu, and Y. Liang,
“C-LSTM: Enabling efficient LSTM using structured compression
techniques on FPGAs,” in Proc. ACM/SIGDA Int. Symp. Field-
Programmable Gate Arrays, 2018.
[17] A. Bourely, J. P. Boueri, and K. Choromonski, “Sparse neural network
topologies,” arXiv preprint arXiv:1706.05683, 2017.
[18] A. Prabhu, G. Varma, and A. M. Namboodiri, “Deep expander net-
works: Efficient deep networks from graph theory,” arXiv preprint
arXiv:1711.08757, 2017.
[19] D. C. Mocanu, E. Mocanu, P. Stone, P. H. Nguyen, M. Gibescu, and
A. Liotta, “Scalable training of artificial neural networks with adaptive
sparse connectivity inspired by network science,” Nature Communica-
tions, vol. 9, 2018.
[20] S. Dey, Y. Shao, K. M. Chugg, and P. A. Beerel, “Accelerating training
of deep neural networks via sparse edge processing,” in Proc. 26th Int.
Conf. Artificial Neural Networks (ICANN). Springer, Sep 2017, pp.
273–280.
[21] S. Dey, P. A. Beerel, and K. M. Chugg, “Interleaver design for deep
neural networks,” in Proc. 51st Asilomar Conf. Signals, Systems, and
Computers, Oct 2017, pp. 1979–1983.
[22] S. Dey, K. Huang, P. A. Beerel, and K. M. Chugg, “Characterizing sparse
connectivity patterns in neural networks,” in Proc. 2018 Information
Theory and Applications Workshop (ITA), Feb 2018, pp. 1–9.
[23] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient
reconfigurable accelerator for deep convolutional neural networks,”
IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017.
[24] Y. Ma, N. Suda, Y. Cao, S. Vrudhula, and J. Seo, “ALAMO: FPGA
acceleration of deep learning algorithms with a modularized RTL
compiler,” Integration, the VLSI Journal, 2018.
[25] S. Zhang, Z. Du, L. Zhang et al., “Cambricon-X: An accelerator for
sparse neural networks,” in Proc. 2016 49th Annu. IEEE/ACM Int. Symp.
Microarchitecture (MICRO), 2016, pp. 1–12.
[26] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula,
J.-s. Seo, and Y. Cao, “Throughput-optimized OpenCL-based FPGA
accelerator for large-scale convolutional neural networks,” in Proc. 2016
ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays. ACM, 2016,
pp. 16–25.
[27] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
“Diannao: A small-footprint high-throughput accelerator for ubiquitous
machine-learning,” in Proc. 19th Int. Conf. Architectural Support for
Programming Languages and Operating Systems (ASPLOS). ACM,
2014, pp. 269–284.
[28] G. Masera, G. Piccinini, M. R. Roch, and M. Zamboni, “VLSI ar-
chitectures for turbo codes,” IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 7, no. 3, pp. 369–379, 1999.
[29] T. Brack, M. Alles, T. Lehnigk-Emden et al., “Low complexity LDPC
code decoders for next generation standards,” in Design, Automation &
Test in Europe Conf. & Exhibition (DATE). IEEE, 2007, pp. 1–6.
[30] S. Crozier and P. Guinand, “High-performance low-memory interleaver
banks for turbo-codes,” in Vehicular Technology Conf., 2001. VTC 2001
Fall. IEEE VTS 54th, vol. 4. IEEE, 2001, pp. 2394–2398.
[31] F. Sun, C. Wang, L. Gong, C. Xu, Y. Zhang, Y. Lu, X. Li, and X. Zhou,
“A high-performance accelerator for large-scale convolutional neural
networks,” in 2017 IEEE Int. Symp. Parallel and Distributed Processing
with Applications and 2017 IEEE Int. Conf. Ubiquitous Computing and
Communications (ISPA/IUCC), Dec 2017, pp. 622–629.
[32] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou, “DLAU: A
scalable deep learning accelerator unit on FPGA,” IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, vol. 36,
no. 3, pp. 513–517, March 2017.
[33] Y. LeCun, C. Cortes, and C. J. Burges, “The MNIST database of
handwritten digits,” http://yann.lecun.com/exdb/mnist/.
[34] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “RCV1: A new benchmark
collection for text categorization research,” Journal of machine learning
research, vol. 5, pp. 361–397, Apr 2004.
[35] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L.
Dahlgren, and V. Zue, “TIMIT acoustic-phonetic continuous speech
corpus,” https://catalog.ldc.upenn.edu/LDC93S1.
[36] A. Krizhevsky, “Learning multiple layers of features from tiny images,”
Master’s thesis, University of Toronto, 2009.
[37] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
2016, http://www.deeplearningbook.org.
[38] J. Yosinski and H. Lipson, “Visually debugging restricted boltzmann
machine training with a 3D example,” in Proc. 29th Int. Conf. Machine
Learning (ICML), 2012.
[39] N. H. E. Weste and D. M. Harris, CMOS VLSI Design: A Circuits and
Systems Perspective, 4th ed. Pearson, 2010.
[40] S. Dey, D. Chen, Z. Li, S. Kundu, K. Huang, K. M. Chugg, and
P. A. Beerel, “A highly parallel FPGA implementation of sparse neural
network training,” in Proc. 2018 Int. Conf. Reconfigurable Computing
and FPGAs (ReConFig), Dec 2018, expanded pre-print version available
at https://arxiv.org/abs/1806.01087.
[41] P. Goyal, P. Dolla´r, R. B. Girshick et al., “Accurate, large minibatch
SGD: training ImageNet in 1 hour,” arXiv preprint arXiv:1706.02677,
2017.
[42] D. Masters and C. Luschi, “Revisiting Small Batch Training for Deep
Neural Networks,” arXiv preprint arXiv:1804.07612, 2018.
[43] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R.
Salakhutdinov, “Improving neural networks by preventing co-adaptation
of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
[44] K.-F. Lee and H.-W. Hon, “Speaker-independent phone recognition
using hidden markov models,” IEEE Transactions on Acoustics, Speech,
and Signal Processing, vol. 37, no. 11, pp. 1641–1648, Nov 1989.
[45] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
Surpassing human-level performance on ImageNet classification,” in
Proc. IEEE Int. Conf. Computer Vision (ICCV), 2015, pp. 1026–1034.
[46] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
in Proc. Int. Conf. Learning Representations (ICLR), 2014.
[47] S. Dey, K. M. Chugg, and P. A. Beerel, “Morse code datasets for
machine learning,” in Proc. 9th Int. Conf. Computing, Communication
and Networking Technologies (ICCCNT), Jul 2018, pp. 1–7.
[48] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with
visual attention,” arXiv preprint arXiv:1412.7755, 2014.
[49] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel,
and Y. Bengio, “Show, attend and tell: Neural image caption generation
with visual attention,” in Proc. 32nd Int. Conf. Machine Learning
(ICML), 2015, pp. 2048–2057.
[50] M. R. Osborne, B. Presnell, and B. Turlach, “A new approach to variable
selection in least squares problems,” in IMA Journal of Numerical
Analysis, 2000.
[51] R. Jenatton, J. Audibert, and F. R. Bach, “Structured variable selection
with sparsity-inducing norms,” Journal of Machine Learning Research,
vol. 12, pp. 2777–2824, 2011.
