Abstract-We demonstrate an FPGA implementation of a parallel and reconfigurable architecture for sparse neural networks, capable of on-chip training and inference. The network connectivity uses pre-determined, structured sparsity to significantly reduce complexity by lowering memory and computational requirements. The architecture uses a notion of edge-processing, leading to efficient pipelining and parallelization. Moreover, the device can be reconfigured to trade off resource utilization with training time to fit networks and datasets of varying sizes. The combined effects of complexity reduction and easy reconfigurability enable significantly greater exploration of network hyperparameters and structures on-chip. As proof of concept, we show implementation results on an Artix-7 FPGA.
I. INTRODUCTION
Neural networks (NNs) in machine learning systems are critical drivers of new technologies such as image processing and speech recognition. Modern NNs are built as graphs with millions of trainable parameters [1] - [3] , which are tuned until the network converges. This parameter explosion demands large amounts of memory for storage and logic blocks for operation, which make the process of training difficult to perform on-chip. As a result, most hardware architectures for NNs perform training off-chip on power-hungry CPUs/GPUs or the cloud, and only support inference capabilities on the final FPGA or ASIC device [4] - [10] . Unfortunately, offchip training results in a non-reconfigurable network being implemented on-chip which cannot support training time optimizations over model architecture and hyperparameters. This severely hinders the development of independent NN devices which a) dynamically adapt themselves to new models and data, and b) do not outsource their training to costly cloud computation resources or data centers which exacerbate problems of large energy consumption [11] .
Training a network with too many parameters makes it likely to overfit [12] , and memorize undesirable noise patterns [13] . Recent works [14] - [17] have shown that the number of parameters in NNs can be significantly reduced without An abridged version of this work was accepted as a short paper at ReConFig: 2018 International Conference on Reconfigurable Computing and FPGAs. This is the full version of this work. degradation in performance. This motivates our present work, which is to train NNs with reduced complexity and easy reconfigurability on FPGAs. This is achieved by using predefined sparsity [14] , [15] , [18] . Compared to other methods of parameter reduction such as [5] , [10] , [19] - [21] , pre-defined sparsity does not require additional computations or processing to decide which parameters to remove. Instead, most of the weights are always absent, i.e. sparsity is enforced prior to training. This results in a sparse network of lesser complexity as compared to a conventional fully connected (FC) network. Therefore the memory and computational burdens posed on hardware resources are reduced, which enables us to accomplish training on-chip. Section II describes pre-defined sparsity in more detail, along with a hardware architecture introduced in [14] which exploits it.
A key factor in NN hardware implementation is finite bit width effect. A previous FPGA implementation [22] used fixed point adders, but more resource-intensive floating point multipliers and floating-to-fixed-point converters. Another previous implementation [23] used probabilistic fixed point rounding techniques, which incurred additional DSP resources. Keeping hardware simplicity in mind, our implementation uses only fixed point arithmetic with clipping of large values.
The major contributions of the present work are summarized here and described in detail in Section III:
• The first implementation of NNs which can perform both training and inference on FPGAs by exploiting parallel edge processing. The design is parametrized and can be easily reconfigured to fit on FPGAs of varying capacity. • A low complexity design which uses pre-defined sparsity while maintaining good network performance. To the best of our knowledge, this is the first NN implementation on FPGA exploiting pre-defined sparsity.
• Theoretical analysis and simulation results which show that sparsity leads to reduced dynamic range and is more tolerant to finite bit width effects in hardware.
II. SPARSE HARDWARE ARCHITECTURE
A. Pre-defined Sparsity
Our notation treats the input of a NN as layer 0 and the output as layer L. The number of neurons in the layers are
The NN has L junctions in between the layers, with N i−1 and N i respectively being the number of neurons in the earlier (left) and later (right) layers of junction i. Every left neuron has a fixed number of edges (or weights) going from it to the right, and every right neuron has a fixed number of edges coming into it from the left. These numbers are defined as out-degree (d ensures that all neurons in a junction contribute equally and none of them get disconnected, since that would lead to a loss of information. The connection density in junction i is given as W i /(N i−1 N i ) and the overall connection density of the network is defined as
Previous works [14] , [15] have shown that overall density levels of < 10% incur negligible performance degradationwhich motivates us to implement such low density networks on hardware in the present work.
B. Hardware Architecture
This subsection describes the mathematical algorithm and the subsequent hardware architecture for a NN using predefined sparsity. The input layer, i.e. the leftmost, is fed activations (a 0 ) from the input data. For an image classification problem, these are image pixel values. Then the feedforward (FF) operation proceeds as described in eq. (1):
Both eqs. (1a) and (1b) are ∀j ∈ {1, · · · , N i }, ∀i ∈ {1, · · · , L}. Here, a is activation,ȧ is its derivative (a-dot), b is bias, w is weight, and σ and σ are respectively the activation function and its derivative (with respect to its input), which are described further in Section III. For a,ȧ and b, subscript denotes layer number and superscript denotes a particular neuron in a layer. For the weights, w
denotes the weight in junction i which connects neuron k f in layer i−1 to neuron j in layer i. The summation for a particular right neuron j is carried out over all d in i weights and left neuron activations which connect to it, i.e. k f ∈ {1, · · · , N i−1 }. These left indexes are arbitrary because the weights in a junction are interleaved, or permuted. This is done to ensure good scatter, which has been shown to enhance performance [15] .
The output layer activations a L are compared with the ground truth labels y which are typically one-hot encoded, i.e. y (j) , ∀j ∈ {1, · · · , N L }, is 1 if the class represented by output neuron j is the true class of the input sample, otherwise 0. We use the cross-entropy cost function for optimization, the derivative of which with respect to the activations is a L − y. We also experimented with quadratic cost, but its performance was inferior compared to cross-entropy. The backpropagation (BP) operation proceeds as described in eq. (2):
The summation for a particular left neuron j is carried out over all d
out i weights and right neuron deltas which connect to it, i.e. k f ∈ {1, · · · , N i+1 }. The right indexes are arbitrary due to interleaving.
Based on the δ values, the trainable weights and biases have their values updated and the network learns. We used the gradient descent algorithm, so the update (UP) operation proceeds as described in eq. (3):
where η is the learning rate hyperparameter. Both eqs. (3a) and (3b) are ∀i ∈ {1, · · · , L}. While eq. (3a) is ∀j ∈ {1, · · · , N i }, eq. (3b) is only for those j ∈ {1, · · · , N i } and k ∈ {1, · · · , N i−1 } which are connected by a weight w
The architecture uses a) operational parallelization to make FF, BP and UP occur simultaneously in each junction, and b) junction pipelining wherein all the junctions execute all 3 operations simultaneously on different inputs. Thus, there is a factor of 3L speedup as compared to doing 1 operation at a time, albeit at the cost of increased hardware resources. Fig.  1 shows the architecture in action. As an example, consider L = 2, i.e. the network has an input layer, a single hidden layer, and an output layer. When the second junction is doing FF and computing cost on input n + 1, it is also doing BP on the previous input n which just finished FF, as well as updating (UP) its parameters from the finished cost computation results of input n. Simultaneously, the first junction is doing FF on the latest input n + L = n + 2, and UP using the finished BP results of input n − (L − 1) = n − 1. BP does not occur in the first junction because there are no δ 0 values to be computed.
The architecture uses edge processing by making every junction have a degree of parallelism z i , which is the number Fig. 2 . Example of clash-freedom in some junction with z = 6. In each cycle, z weights are read corresponding to 2 right neurons (shown in same color). When traced back through the interleaver π W , this requires accessing z left activations in permuted order. There are z activation memories M 0 − M 5, only 1 element from each is read in a cycle in order to preserve clash-freedom. This is shown by the checkerboards, where only 1 cell in each column is shaded. Picture taken from [18] with permission. of weights processed in parallel in 1 clock cycle (or simply cycle) by all 3 operations. So the total number of cycles to process a junction is W i /z i plus some additional cycles for memory accesses. This comprises a block cycle, the reciprocal of which is ideal throughput (inputs processed per second).
All parameters and computed values in a junction are stored in banks of z i memories. The z i weights in the kth cells of all z i weight memories are read out in the kth cycle. Additionally, up to z i activations, a-dots, deltas and biases are accessed in a cycle. The order of accessing them can be natural (row-byrow like the weights), or permuted (due to interleaving). All accesses need to be clash-free, i.e. the different values to be accessed in a cycle must all be stored in different memories so as to avoid memory stalls, as shown in Fig. 2 . Optimum clashfree interleaver designs are discussed in [18] . Fig. 3 shows simultaneous FF, BP and UP, along with memory accesses, in more detail inside a single junction.
This architecture is ideal for implementation on reconfigurable hardware due to a) its parallel and pipelined nature, b) its low memory footprint due to sparsity, and particularly c) the degree of parallelism z i parameters, which can be tuned to efficiently utilize available hardware resources, as described in Sections III-D and III-E. We implemented the architecture described in Section II-B on an Artix-7 FPGA. This is a relatively small FPGA and therefore allowed us to explore efficient design styles and optimize our RTL to make it more robust and scalable. We experimented on the MNIST dataset where each input is an image consisting of 784 pixels in 8-bit grayscale each. Each ground truth output is one-hot encoded between 0-9. Our implementation uses powers of 2 for network parameters to simplify the hardware realization. Accordingly we padded each input with 0s to make it have 1024 pixels. The outputs were padded with 0s to get 32-bit one-hot encoding. Prior to hardware implementation, software experiments showed that having extra always-0 I/O did not detract from network performance.
B. Network Configuration and Training Setup
The network has 1 hidden layer of 64 neurons, i.e. 2 junctions overall. Other parameters were chosen on the basis of hardware constraints and experimental results, which are described in Sections III-C and III-D. The final network configuration is given in Table I .
We selected 12544 MNIST inputs to comprise 1 epoch of training. Learning rate (η) is initially 2 −3 , halved after the first 2 epochs, then after every 4 epochs until its value became 2 −7 . Dynamic adjustment of η leads to better convergence, while keeping it to a power of 2 leads to the η multiplications in eq. (3) getting reduced to bit shifts. Pre-defined sparsity leads to a total number of trainable parameters = (w 1 = 4096) + (w 2 = 1024) + (b 1 = N 1 = 64) + (b 2 = N 2 = 32) = 5216, which is much less than 12544, so we theorized that overfitting was not an issue. We verified this using software simulations, and hence did not apply weight regularization. range of ±0.51 for junction 1 and ±0.61 for junction 2 in our network configuration described in Table I .
The biases in our architecture are stored along with the weights as an augmentation to the weight memory banks. So we initialized biases in the same manner as weights. Software simulations showed that this led to no degradation in performance from the conventional method of initializing biases with 0s. This makes sense since the maximum absolute value from initialization is much closer to 0 than their final values when the network converges, as shown in Fig. 4 .
To simplify the RTL, we used the same set of W i /z i unique values to initialize all weights and biases in junction i. Again, software simulations showed that this led to no degradation in performance as compared to initializing all of them randomly. This is not surprising since an appropriately high value of initial learning rate will drive each weight and bias towards its own optimum value, regardless of similar values at the start.
2) Fixed Point Configuration: We recreated the aforementioned initial conditions in software and trained our configuration to study the range of values for network variables until convergence. The results for w, b and δ are in Fig. 4 . The a values are generated using the sigmoid activation function, which has range = [0, 1].
To keep the hardware optimal, we decided on the same fixed point bit configuration for all computed values and trainable parameters -a,ȧ, δ, w and b. Our configuration is characterized by the bit triplet (b w , b n , b f ), which are respectively the total number of bits, integer bits, and fractional bits, with the constraint b w = b n + b f + 1, where the 1 is for the sign bit. This gives a numerical range of Fig. 4 shows that the maximum absolute values of various network parameters during training stays within 8. Accordingly we set b n = 3. We then experimented with different values for the bit triplet and obtained the results shown in Table II . Accuracy is measured on the last 1000 training samples. Noting the diminishing returns and impractical utilization of hardware resources for high bit widths, we chose the bit triplet (12, 3, 8) as being the optimal case.
3) Dynamic Range Reduction due to Sparsity: We found that sparsity leads to reduction in the dynamic range of network variables, since the summations in eqs. (1) and (2) = 32). Notice that the sparse case only has 17% of its values clipped due to being outside the dynamic range afforded by b n = 3, while the FC case has 57%. The sparse case also has a smaller variance. This implies that the hardware errors introduced due to finite bit-width effects are less pronounced for our pre-defined sparse configuration as compared to FC.
4) Experiments with ReLU:
As demonstrated in literature [1] - [3] , the native (ideal) ReLU activation function is more widely used than sigmoid due to the former's better performance, no vanishing gradient problem, and tendency towards generating sparse outputs. However, ideal ReLU is not practical for hardware due to its unbounded range. We experimented with a modified form of the ReLU activation function where the outputs were clipped to a) 8, which is the maximum supported by b n = 3, and b) 1, to preserve bit width consistency in the multipliers and adders and ensure compatibility with sigmoid activations. Fig. 6 shows software simulations comparing sigmoid with these cases. Note that ReLU clipped at 8 converges similar to sigmoid, but sigmoid has better initial performance. Moreover, there is no need to promote extra sparsity by using ReLU because our configuration is already sparse, and sigmoid does not suffer from vanishing gradient problems because of the small range of our inputs. We therefore concluded that sigmoid activation for all layers is the best choice.
D. Implementation Details 1) Sigmoid Activation:
The sigmoid function uses exponentials, which are computationally infeasible to obtain in hardware. So we pre-computed the values of σ(·) and σ (·) and stored them in look-up tables (LUTs). Interpolation was not used, instead we computed sigmoid for all 4096 possible 12-bit arguments up to the full 8 fractional bits of accuracy. On the other hand, its derivative values were computed to 6 fractional bits of accuracy since they have a range of [0, 2 −2 ]. Note that clipped ReLU activation uses only comparators and needs no LUTs. However, the number of sigmoid LUTs required is
, which incurs negligible hardware cost. This reinforces our decision to use sigmoid instead of ReLU.
2) Interleaver: We used clash-free interleavers of the SV+SS variation, as described in [18] . Starting vectors for all sweeps were pre-calculated and hard-coded into FPGA logic.
3) Arithmetic Units: We numbered the weights sequentially on the right side of every junction, which leads to permuted numbering on the left side due to interleaving. We chose z i ≥ d BP does not occur in the first junction since the input layer has no δ values. The BP summation in eq. (2b) will need several cycles to complete for a single left neuron since weight numbering is permuted. This necessitates storing L i=2 z i partial sums, however, tree adders are no longer required.
Eq. (2b) for BP has 2 multiplications, so the total number of multipliers required is 2 L i=2 z i . The UP operation in each junction i requires z i adders for the weights and z i /d in i adders for the biases, since that many right neurons are processed every cycle. Only the weight update requires multipliers, so their total number is L i=1 z i . Our FPGA device has 240 DSP blocks. Accordingly, we implemented the 224 FF and BP multipliers using 1 DSP for each, while the other 160 UP multipliers and all adders were implemented using logic.
4) Memories and Data: All memories were implemented using block RAM (BRAM). The memories for a andȧ never need to be read from and written into in the same cycle, so they are single-port. δ memories are true dual-port, i.e. both ports support reads and writes. This is required due to the readmodify-write nature of the δ memories since they accumulate partial sums. The 'weight+bias' memories are simple dualport, with 1 port used exclusively for reading the kth cell in cycle k, and the other for simultaneously writing the (k − 1)th cell. These were initialized using Glorot normal values while all other memories were initialized with 0s.
The ground truth one-hot encoding for all 12544 inputs were stored in a single-port BRAM, and initialized with word size = 10 to represent the 10 MNIST outputs. After reading, the word was padded with 0s to make it 32-bit long. On the other hand, the input data was too big to store on-chip. Since the native MNIST images are 28×28 = 784 pixels, the total input data size is 12544×784×8 = 78.68 Mb, while the total device BRAM capacity is only 4.86 Mb. So the input data was fed from PC using UART interface.
5) Network Configuration: Here we explain the choice of network configuration in Table I . We initially picked N 2 = 16, which is the minimum power of 2 above 10. Since later junctions need to be denser than earlier ones to optimize performance [15] , we experimented with junction 2 density and show its effects on network performance in Fig. 7 . We concluded that 50% density is optimum for junction 2. Note that individual z i values should be adjusted to have the same block cycle length for all junctions. This ensures an always full pipeline and no stalls, which can achieve the ideal throughput of 1 input per block cycle. This, along with the constraint z i ≥ d in i , ∀i ∈ {1, · · · L}, led to z 1 = 256, which was beyond the capacity of our FPGA. So we increased N 2 to 32 and set z 2 to the minimum value of 32, leading to z 1 = 128. We experimented with d = 4. 6) Timing and Results: A block cycle in our design is (W i /z i + 2) clock cycles since each set of z i weights in a junction need a total of 3 clock cycles for each operation. The first and third are used to compute memory addresses, while the second performs arithmetic computations and determines our clock frequency, which is 15MHz.
We stored the results of several training inputs and fed them out to 10 LEDs on the board, each representing an output from 0-9. The FPGA implementation performed according to RTL simulations and within 1.5 percentage points of the ideal 
E. Effects of z
A key highlight of our architecture is the total degree of parallelism L i=1 z i , which can be reconfigured to trade off training time and hardware resources while keeping the network architecture the same. This is shown in Fig. 8 . The present work uses total z = 160, which leads to a block cycle time of 2.27µs, but economically uses arithmetic resources and has a small number of deep memories, making it ideal for a fully BRAM implementation. Given more powerful FPGAs, the same architecture can be reconfigured to achieve higher GOPS count and process inputs in 0.4µs, albeit at the cost of more FPGA resources and a greater number of shallower memories. Moreover, this reconfigurability also allows a complete change in network structure and hyperparameters to process a new dataset on the same device if desired.
IV. CONCLUSION
This paper demonstrates an FPGA implementation of both training and inference of a neural network pre-defined to be sparse. The architecture is optimized for FPGA implementation and uses parallel and pipelined processing to increase throughput. The major highlights are the degrees of parallelism z i , which can be quickly reconfigured to re-allocate FPGA resources, thereby adapting any problem to any device. While the present work uses a modest FPGA board as proof-ofconcept, this reconfigurability is allowing us to explore various types of networks on bigger boards as future work. Our RTL is fully parametrized and the code available on request.
