We propose a class of interleavers for a novel deep neural network (DNN) architecture that uses algorithmically predetermined, structured sparsity to significantly lower memory and computational requirements, and speed up training. The interleavers guarantee clash-free memory accesses to eliminate idle operational cycles, optimize spread and dispersion to improve network performance, and are designed to ease the complexity of memory address computations in hardware. We present a design algorithm with mathematical proofs for these properties. We also explore interleaver variations and analyze the behavior of neural networks as a function of interleaver metrics.
I. INTRODUCTION
DNNs in machine learning systems are critical drivers of new technologies such as speech processing and autonomous vehicles. Modern DNNs typically have millions of parameters [1] , which make them difficult to implement in hardware and slow to train [2] . A suggested solution to these problems is a sparse network, where some form of compression or deletion is employed to reduce the number of parameters [3] , [4] . However, an issue with sparse networks is that some neurons may get completely disconnected from neighboring layers and have no effect on the output [5] . A second issue arises when all the neurons in a certain layer which connect to a certain neuron in the next layer are 'close together', such as coming from nearby pixels in an image. This issue is similar to convolutional layers, which are known to be inadequate for classification without the presence of fully connected (FC) classification layers [1] , [2] . The term 'layer' will henceforth to classification layer, which is what this work deals with.
We are investigating a class of hardware-optimized DNN architectures which use pre-defined sparsity, wherein a connection pattern is algorithmically defined using an interleaver, or permutation, for every junction between 2 layers prior to training. This has the potential to achieve higher training speed and lower storage complexity compared to approaches which start training the full network and then remove parameters [6] , [7] . A related paper [8] has demonstrated that our approach can reduce the memory footprint of FC layers in CNNs by 457x without performance degradation.
This paper is a followup to our previous work [9] and focuses on the design and analysis of interleavers suited to the requirements of our hardware architecture, which is reviewed in Section II. The key contributions of this paper are: 1) Mathematical formalizations of desirable properties of a class of interleavers usable in DNNs (Section III). The interleavers should implement pseudo-random connection patterns between layers so as to achieve: a) Flexible degrees of sparsity in the junctions, while preventing neurons from getting disconnected. b) Maximum operational efficiency by avoiding pipeline stalls. c) Ease of address computation for on-chip memories.
2) An algorithm to design such interleavers (Section IV-A) and mathematical proofs to show that it satisfies the requirements (Section IV-B).
3) Possible variations in interleaver design (Section IV-C). 4) Relations between network performance and interleaver
metrics such as spread and dispersion (Section IV-D), explored through training on different datasets.
II. HARDWARE ARCHITECTURE
A DNN is made up of layers of neurons, and junctions connecting adjacent layers via weights, or edges. We will use p and n to represent the number of neurons in the preceding (left) and succeeding (right) layers, respectively, of any junction. Every left neuron has a fixed number of edges going from it to the right, and every right neuron has a fixed number of edges coming into it from the left. These numbers are defined as fan-out (fo) and fan-in (fi), respectively. For a conventional FC junction, f o = n and f i = p. Every neuron has associated activation and delta values which are used in the 3 operations -feedforward (FF), backpropagation (BP), and update (UP).
A. Our Architecture
For our sparse architecture, f o < n and f i < p, such that p × f o = n × f i = W , the total number of edges in the junction. They are sequentially indexed on the right side, for example, the 1st right neuron has weights w 0 to w f i−1 . Motivated by the fact that the weights feature in all 3 network operations, we designed an edge-processing architecture where every junction has a degree of parallelism (DoP), denoted as z, which is the number of weights processed in parallel. (z is chosen such that p/z is an integer). All the weights in each junction are stored in a bank of z memories, each having W/z cells, as shown in Fig. 1 . This means that 2 weights with indices i and j (i.e. weights w i and w j ) are in the same memory if i%z = j%z, where % is the modulo operator. If instead i/z = j/z , where . is the floor function, then the weights are in the same row of different memories.
Similar to the weights, all the activation and delta values of each layer are numbered and stored in separate banks of z memories each. For example, the left layer activations are numbered from a 0 for the first neuron to a p−1 for the last, and each activation memory would have p/z elements. The edges coming into a junction from the left pass through a weight interleaver (π W ) before getting connected to the right. For example, say 4 edges come out of the 1st neuron of a certain layer of a network which has a 100-neuron layer following it. These edges might connect to the 9th, 30th, 67th and 84th neurons of the following layer.
A single cycle of processing (say the kth) comprises accessing the kth cell in each of the z weight memories. This implies reading all z values from the kth row of the bank, which we refer to as natural order access, as shown in Fig.  1 . The interleaver determines which neurons in the left layer are connected to those z edges. In general, these could be any z neurons in the left layer. So the activation memories are accessed in permuted order. Fig. 2 shows this through an example where z is 6 and f i is 3. Note that all the entries in the left activation memory bank are read f o times, since that many weights belong to the same neuron and share the same activation value. Each stage of processing where all the activations are read once is referred to as a sweep, which consists of p/z cycles. One complete operation such as FF consists of f o sweeps, i.e. p × f o/z = W/z cycles, which are collectively referred to as 1 block cycle.
B. Merits of our Architecture
Since there is significant data reuse between FF, BP and UP, we use operational parallelization to make all of them occur concurrently. Since every operation in a junction uses data generated by an adjacent junction or layer, we designed a junction pipelining architecture where all the junctions execute all 3 operations concurrently on different inputs from the training set. This enables our architecture to achieve a 3(L−1) times speedup for L layers. See [9] for a complete description.
Note that z can be set to any value as per the overall areaspeed tradeoff desired. The number of clock cycles to process each junction can be made equal by adjusting z for each individually. This ensures an always full pipeline and no stalls. Thus, the size and complexity of the network is decoupled from the hardware resources available. Our architecture can be reconfigured to varying amounts of fan-out and sparsity, which makes it adaptable to a large class of DNNs. This speedup and flexibility gives us the potential to achieve online training, as compared to inference-only works such as [4] . 
III. INTERLEAVER REQUIREMENTS
An interleaver π operates on elements i from a list x with cardinality N and produces rearranged list elements π(i). We will follow the convention that x = {0, 1, ..., N − 1}.
As an example, let x = {0, 1, 2, 3}, i.e. N = 4. Then π(x) = {π(0), π(1), π(2), π(3)}, such as {1, 3, 2, 0}. Interleaver patterns can be visualized by plotting π(x) vs. x.
A. Clash Freedom
As mentioned before, the activation memories are accessed in permuted order. For any weight index i, the corresponding left activation index is π W (i)/f o . The z activations read in a cycle should come from z different left neurons in order to achieve optimum spatial spread. Moreover, these z values should be stored in z different activation memories. Violating this condition leads to the same memory needing to be accessed more than once in the same cycle, i.e. a clash, which stalls processing. Notice that Fig. 2 is free from clashes since all the columns in permuted order accesses have exactly 1 shaded cell. Clash-freedom is mathematically expressed as:
Then we need
where i = j is implicitly assumed here, and in the future. Equation (1) implies that for 2 weights w i and w j read in the same cycle, their left activations must be in different memories.
B. Ease of Memory Address Computation
The interleaver should be designed so that the addresses of the activation memories (accessed in permuted order) can be easily computed in any cycle. This can be done by defining a starting cell index -to be used in the first cycle of every sweep -for each activation memory. Cell indices for the following cycles are obtained by adding 1 each time to the starting index, and cycling back to the first cell after reaching the last.
As a concrete example, assume p = 32, f o = 2, and z = 8. This leads to the activation memory mapping shown in Fig.  3 . Let us define the starting cell indices for the 8 activation memories as s = {2, 0, 3, 1, 2, 0, 3, 1}. Then the cells read in the next cycle will be (s + 1)%4 = {3, 1, 0, 2, 3, 1, 0, 2}, and so on until all 4 cycles in the sweep are completed. This can be mathematically expressed as:
and
Equations (2a) and (2b) consider 2 weights with indices i and j such that they are in different cycles and the left neurons to which they connect are in the same activation memory. Then, (2c) states that the difference in cycle numbers should be equal to the difference in activation memory row numbers. This leads to ease of address computation.
C. Optional Requirements -Spread and Dispersion
Spread is a standard interleaver metric which, when maximized, ensures that for 2 weights that are close together on the right (such as going to the same neuron), the neurons from which they come on the left are spaced well apart. Spread is classically defined [10] as:
Normalized dispersion, which we will simply refer to as dispersion is another standard metric measuring the randomness in the connection pattern. For example, if the 1st left neuron connects to the 10th, 20th and 30th right neurons, and the 2nd left neuron connects to the 11th, 21st and 31st right neurons, the pattern is quite regular and not well dispersed. Dispersion is classically defined [11] as the cardinality of the set by N (N − 1) . The effects of spread and dispersion on network performance are discussed in Section IV-D.
IV. INTERLEAVER DESIGN

A. Algorithm
Given the requirements of the DNN, we developed the following algorithm to design a suitable class of interleavers:
1) Let r be a random permutation of [0, p/z − 1] 2) Create list s with z elements according to: a) If z ≥ p/z: Replicate r z p/z times b) If z < p/z: Take the 1st z elements of r 3) Create list t with p elements by concatenating s, (s + 1)%(p/z), ..., (s + p z − 1)%(p/z). t acts as an activation interleaver (π A ), from which π W can be obtained. 4) Let t[x] denote the xth element of t. Then:
Consider the prior example from Section III-B. Say r = {2, 0, 3, 1}. Since z ≥ p/z, s = {2, 0, 3, 1, 2, 0, 3, 1}. Since p/z = 4, t = {2,0,3,1,2,0,3,1,3,1,0,2,3,1,0,2,0,2,1,3,0,2,1,3,1,3,2,0,1,3,2,0}. There are 64 weights. Say we are in cycle 5, where one of the weights read is w 45 . Using (5), t[45%32] = t[13] = 1. This gives the row number in the left activation memory bank. The term i%z is the bank column, which is 45%8 = 5. Now the key purpose of the interleaver equation, which is to compute the addresses of the activation memory bank used in a cycle, is served. Since our architecture uses powers of 2 for all the key variables, operations such as multiplication, modulo and flooring reduce to simple bit shifts and bit selects.
The remainder of (5) serves the purely mathematical purpose of completely characterizing π W as a permutation of 64 weights. Multiplying the bank row by z = 8 and adding the bank column gives the left neuron number from where the weight comes into the junction, i.e. 1×8+5 = 13. Multiplying this by f o = 2 takes us from the activation space to the weight space, while the final addition of 45/32 = 1 adds an offset to indicate that it's the 2nd weight from neuron 13. The final index of the weight on the left side is 27. Thus, π W (45) = 27.
B. Meeting Requirements
Now we will prove that given the interleaver design equation (5) , the requirements in (1) and (2) are satisfied.
1) Clash Freedom: Proof: Since W = p × f o, the i/p term in (5) is in the range [0, f o − 1]. Then we get:
It is given from (1a) that i/z = j/z , but i = j as usual. So it must be that i%z = j%z. This implies that:
which satisfies (1b).
2) Ease of Memory Address Computation:
Proof: Firstly, note that using (6) and (7) , (2b) can be written as i%z = j%z. Secondly, using (6):
So the right hand side of (2c) can be written as (t[i%p] − t[j%p])%(p/z). t is constructed by concatenating s repeatedly with some changing offset added to it every time. Using this, and the fact that s has z elements, we get:
So the modified right hand side of (2c) now becomes:
We will use 2 mathematical theorems in this proof. Given any 3 positive integers a, b and c, firstly:
Secondly, if b is an integral multiple of c:
Using (12) and the fact that i%z = j%z, (11) becomes:
Using (12), (13) and the fact that p is an integral multiple of z, the left hand side of (2c) becomes:
which equals the right hand side of (2c), as obtained in (14). Thus, the requirement in (2c) is satisfied.
C. Variations
The basic π W described so far has excellent spread, but poor dispersion. We experimented with the following variations: 3) Memory Dither (MD): Equation (5) reveals that for any cycle, the weight read from the ith weight memory (i ∈ [0, z − 1]) will always trace back to a left activation value stored in the ith activation memory. This trait can be removed and dispersion increased by replacing the 'activation memory number generating' term i%z in (5) with v[i%z], where v is a random permutation of [0, z − 1]. The revised equation is: 
D. Analysis and Results
Table I lists average spread and dispersion (disp.) over 100 iterations of all possible variations of π W and corresponding π A . Some of the patterns are shown in Figs. 4 and 5. Note that the basic π W is the most linear, which leads to maximum spread and minimum dispersion. SS offers lesser spread and more dispersion for π W , but no effect is observed on π A . This is because SS affects different sweeps which have different weights, but same activations. SV offers slight increase in dispersion, but severe reduction in spread. This is because the SV pattern has lines with slope identical to basic, but each line is permuted, leading to left neurons getting bunched up. Introducing MD leads to big increases in dispersion, which are further increased for π W when combined with SS. This is observed in the figures, where the MD patterns are irregular. Fig. 6 shows results of all the possible interleaver variations implemented on networks trained for 10 epochs, with classification accuracy on validation data used as the performance metric. We used 3 datasets of different dimensionalities: , leading to an overall density of 37.5%. We observed that interleaver variations have negligible effect on classification accuracy of MNIST and CIFAR10 datasets. Note that for these datasets, the distinction between output classes is well pronounced. In MNIST for example, an image of a handwritten 7 is very different from a handwritten 0. Moreover, since the inputs in CIFAR10 are pre-processed by convolutional and pooling layers, the relative importance of the final classification layers is reduced.
For the Morse dataset, however, a clear trend of high dispersion hurting performance is observed. The 4 variations with MD have dispersion ≥ 0.5 and barely reach 80% accuracy, while the ones without MD have dispersions ≤ 0.2 and achieve ≥ 90% accuracy. This dichotomy is further highlighted in Fig. 7 . We hypothesize that this is due to the Morse dataset having lower redundancy compared to the other 2 since it has less input neurons and more output classes with little distinction between them. We are currently working on theories to better explain the link between dataset redundancy and high dispersion of junction connection patterns degrading performance.
V. CONCLUSION
This work presents a new way to design DNNs in hardware by interleaving edges between neurons and processing a programmable number of edges in parallel. The interleaver needs to be designed so as to achieve optimum network runtime Fig. 7 . Classification accuracy vs. epochs obtained using different interleavers by training a 37.5% dense network on the Morse dataset. efficiency on hardware. At the same time, performance needs to be maximized by selecting an interleaver with desirable metrics. We present an algorithm to satisfy interleaver requirements and investigate possible variations to it and their effects.
One limitation of these interleavers is that they characterize a single junction. To completely characterize a sparse network, it is desirable to have formulations which describe connection patterns in the whole network, such as which outputs connect to which inputs. We are currently working on the theory of adjacency matrices, which have elements corresponding to connections between any 2 neurons in any 2 layers, and exploring metrics which act as better proxies for performance.
