Most neural network designs for FPGAs are inflexible. In this paper, we propose a flexible VHDL structure that would allow any neural network to be implemented on multiple FPGAs. Moreover, the VHDL structure allows for testing as well as training multiple neural networks. The VHDL design consists of multiple processor groups. There are two types of processor groups: Mini Vector Machine Processor Group and Activation Processor Group. Each processor group consists of individual Mini Vector Machines and Activation Processor. The Mini Vector Machines apply vector operations to the data, while the Activation Processors apply activation functions to the data. A ring buffer was implemented to connect the various processor groups.
This paper proposes an FPGA solution to the problems above. The solution consists of an assembler and a VHDL design. The assembler takes in neural network assembly codes and produces microcodes. The microcodes are flashed onto a cluster of FPGA. The cluster of FPGAs executes multiple neural networks in parallel, which accelerates the training and testing phases. Furthermore, the cluster of FPGAs allows for a greater memory bandwidth. The cluster of FPGAs overcomes the memory bandwidth limitations of individual FPGAs.
Multi-Layer Perceptions
Let X i = data input vector of layer i 
ReLU (x) = max(0, x)
Multi-layer perceptions (MLPs) [6] are a type of neural network. MLPs have an input layer, multiple hidden layers, and an output layer. The input data enters a layer through the input vector X i . Then the input vector X i is multiplied by the weights W T i . After matrix multiplication, biases B i are added to the W T i X i . After that, the result passes through the activation function A(V ) and produces the layer's output O i . There are many types of activation functions used in neural networks. For example, Eqn. 2 shows the ReLU activation function. The ReLU activation function sets all negative numbers to zero. Overall, the input data goes through many layers until the final result is produced at the output layer.
Design Overview and Requirements
The goal of the project is to accelerate multiple neural networks using multiple FPGAs. The targeted FPGA boards must use Xilinx's 7 Series FPGAs. All the FPGA boards must be identical. Moreover, the FPGA boards must have onboard flash, RAM, and system buses. Fig. 1 shows the neural network processor and assembler. The Matrix Assembler is a high level optimizing assembler, which parses the neural network assembly codes. The Matrix Assembler parses as many neural network assembly codes as the user wants. After parsing the assembly codes, the Matrix Assembler optimizes the assembly codes and neural network processors. Then the Matrix Assembler generates the VHDL codes and the microcodes. The VHDL codes contain the structure of the Matrix Machine. The Matrix Machine consists of multiple Mini Vector Machines. Each Mini Vector Machine computes a vector operation using a single DSP. The DSPs are set to process 16 bit signed integers. 16 bit precision is enough for most of the neural network applications. When the Mini Vector Machines are put together, the Mini Vector Machines perform matrix operations. The Mini Vector Machines allow the Matrix Machine to adapt to different sizes of matrices. Subsequently, the VHDL codes are synthesized into the bit-streams using the Xilinx's Vivado Design Suite. After generating the bit-streams, the bit-streams are flashed to the onboard flash. The onboard flash then loads the bit-stream onto the FPGA. The system buses transfer the neural network data and microcode from the control server to the onboard RAM. The onboard RAM acts as a buffer for the FPGA. The microcodes schedule the execution of the Matrix Machine by coordinating the individual Mini Vector Machines.
For the functional requirements, the Matrix Machine must train and test MLPs. The Matrix Machine must calculate the forward passes of the MLPs. After calculating the forward passes, the loss functions' gradients must be calculated using the back-propagation algorithm. The gradients are then used to update the weights of the MLPs. In order to be flexible, the VHDL design must be generalized to run any type of MLP. Firstly, the Matrix Assembler must handle any number of MLPs regardless of the number of FPGAs. Secondly, the Matrix Machine must handle matrices of any size and shape. The input matrices, the weight matrices, and the bias matrices could be as big as the user wants. Thirdly, the Matrix Machine must be able to dynamically load different MLPs at runtime. In other terms, the Matrix Machine must be able to switch between different MLPs without regenerating the bit-stream. Fourthly, the Matrix Machine must scale to any number of LUTs, BRAMs, and DSPs. If the Matrix Assembler detects the FPGA has a high number of DSPs, then the Matrix Assembler generates more Mini Vector Machines to take advantage of the DSPs. If the Matrix Assembler detects the FPGA has a low number of DSPs, then the Matrix Assembler reduces the number of Mini Vector Machines. Lastly, the Matrix Machine must scale to any number of FPGAs. If the number of MLPs is greater than the 
Matrix Assembler: High Level Optimizing Assembler
The Matrix Assembler takes in neural network assembly codes and produces instructions and VHDL codes. At runtime, the instructions are decoded into microcodes. The decoding is done to reduce the size of the instruction cache. Moreover, the Matrix Assembler controls the number of processor groups and the types of processors using the VHDL codes. As a result, the Matrix Assembler is able to optimize the VHDL codes for a specific FPGA. The Matrix Assembler translates the assembly codes to the instructions. Table 2 shows the list of instructions. Matrix multiplication is achieved by using multiple vector dot operations. Moreover, matrix addition is achieved using by multiple vector additions. Fig. 2 shows the bit arrangement for the instruction architecture. The operation code controls the type of operation, while the number of iterations controls the number of loops. Moreover, the operation code is applied to the processors designated by the processor select start and the processor select end. For the 32 bit version, the instructions only control a maximum of 128 processor groups. For the 48 bit version, the instructions only control a maximum of 1024 processor groups.
Assembly Codes

Microcode
The Matrix Assembler also translates the instructions to microcode. Fig. 3 shows the 32 bit mircocode. Each microcode controls 4 MVMs. The MVMs are arranged in groups of 4 because the 4:1 multiplexer is the most efficient multiplexer. The 4:1 multiplexer uses the least amount of LUTs and has the lowest latency. microcode(9..0) controls the number of cycles in a microcode. The number of cycles allows the Matrix Assembler to execute a given microcode for any length of time. microcode(10) controls the selection of the input columns. If input column 0 is selected, then the input data is written to column 0. If input column 1 is selected, then the input data is written to column 1. microcode (11) controls the activation of the input counter. If the input counter is enabled, then the input counter increments at every cycle. The input counter's value is feed into the input addresses of the individual MVMs. microcode (12) Let N ACT P RO_P G = optimal number of Activation processor groups
The Matrix Assembler determines the optimal number of processor groups in order to fully utilize the FPGA's resources. Eqn. 3 shows the equation for the optimal number of Mini Vector Machine processor groups N M V M _P G . The number of Mini Vector Machine processor groups N M V M _P G is only limited by the number of DDR RAM channels N DDR . Furthermore, Table 3 shows the resource usages of each processor group. The optimal number of Activation processor groups N ACT P RO_P G is calculated using Table. 3 and Eqn. 4. Table 4 shows the ports of the MVM processor group. The group control port starts and stops the executions of the processor group. Furthermore, the microcode input port is written to the microcode cache. The microcode cache is used to minimize the load penalties. After writing the microcodes, the microcodes are decoded and are sent to the individual processors. The microcode controls the counters, number of cycles, and the type of operation. Each processor group has two input data ports and two output data ports. Each input port receives a 16 bit integer. The output ports transmit 16 bit integers.
Matrix Machine: Neural Network Processors
The structure of the MVM processor group is presented in Fig. 5 . The MVM processor group consists of 4 processors joined together by 1 x 4:1 multiplexer, 1 x microcode cache, and 1 x local controller. The processors are arranged in groups of 4 because the 4:1 multiplexer is the most efficient multiplexer. Each MVM processor group uses 495 LUTs, 1642 FFs, 4 x DSP48E1, and 8 x RAMB18Ks in total. The microcode cache stores 16 microcodes in total. The 8 bit input counter is used to select the input addresses of the MVMs. The input counter allows the MVMs to load the vectors column-wise. Column-wise vector loading enables the MVMs to cache the column vectors in order to minimize the load penalties. The 8 bit output counter is used to store vectors column-wise. The output counters are designed to mirror the input counters. The output multiplexer is used to select the outputs of the MVMs. 
For vector addition and N I = 1024 iterations, the total number of run cycles and total number of cycles are calculated below. Also the efficiency and processing rate are calculated. The processor groups have high efficiency as the efficiency approaches 50% for vector operations. Moreover, each processor group processes elements at a rate of > 5000 M b s , which is 1 5 the bandwidth of a 32 bit DDR2 RAM. The Mini Vector Machine's purpose is to execute vector operations. Tab. 5 shows the Mini Vector Machine's ports. The Mini Vector Machine uses clocks of 100MHz, 100MHz, 300MHz, and 500MHz for Spartan-7, Artix-7, Kintex-7, and Virtex-7 respectively. The processor control signal is shown in Tab. 6. The processor control signal allows the Mini Vector Machine to run vector dot product, vector summation, vector addition, and vector subtraction. Moreover, the processor control signal manages the BRAMs' reading and writing. The Mini Vector Machine has 2 input ports and 1 output port. The input ports have input data lines and input address lines. The input ports allow vectors to be written to the left BRAM. The output port has a output data line and a output address line. The output port allows vectors to be read from the right BRAM. Fig. 6 shows the structure of the Mini Vector Machine. The Mini Vector Machine consists of 1 x DSP48E1, 2 x BRAM, 2 x counter, and control logic. The control logic requires 50 LUTs and 210 FFs. Each BRAM (RAMB18E1) [7] [8] stores 1024 x 16 bit signed value. Furthermore, each BRAM has two read/write ports. The left BRAM's dual outputs are feed to the dual inputs of the DSP48E1. Then the DSP48E1 [9] performs arithmetic on the DSP48E1's inputs. After computing the values, the DSP48E1 outputs a 48 bit signed result. Subsequently, the 48 bit signed integer is truncated into a 16 bit signed integer. The DSP48E1's single output is connected to the right BRAM's port 0. The right BRAM's port 0 is always set to write DSP48E1's output, while port 1 is always set to read the right BRAM's data. Once the left BRAM is full, the Mini Vector Machine executes the vector operations. Fig. 8 shows the Mini Vector Machine's vector addition. The 1st cycle is used for the setup phase of the DSP48E1, BRAMs, read counter, and write counter. In the 2nd cycle, the left BRAM is read using the read counter. At the same time, the read counter is incremented. In the 3rd cycle, the DSP48E1's A and B ports are feed with the left BRAM's data. The DSP48E1 is configured as a 6 stage pipeline. At the 8th cycle, the DSP48E1's P port outputs the result. Also in the 8th cycle, the write counter increments. In the 9th cycle, the right BRAM writes the result using the write counter.
Mini Vector Machines
Activation Processors
The Activation Processor performs bit shifts and executes the activation function. The Activation Processor's ports are similar to the Mini Vector Machine's ports shown in Table 5 . The only difference is the size of the processor control signal. Table 7 shows the list of controls for the Activation Processor. At the 2nd cycle, the control logic reads the left BRAM using the read counter. At the same time, the read counter is incremented. In the 3rd cycle, the Activation Processor shifts the 2 x 16 bit integer. In the 5th cycle, result of the activation function is retrieved. In the 6th cycle, the write counter is incremented. In the 7th cycle, the result is written to the right BRAM using the write counter. 
Performance/Cost Evaluation
The main limiting factor in the FPGAs' performances is the DDR throughput R. [12] shows the performance/cost evaluation of FPGAs. Only the Spartan-7 and Artix-7 families were considered because they have the highest performance/cost ratio. Firstly, the FPGAs' DDR throughputs R were calculated using the Eqn. 10. Secondly, the performance/cost ratios F were calculated using the costs of the FPGAs and Eqn. 11. Finally, Spartan-7 XC7S75-2 was selected as the best FPGA because the XC7S75-2 has the highest performance/cost ratio. Moreover, a cluster of FPGAs could built using the XC7S75-2. The cluster would outperform a standalone FPGA because the cluster has a higher number of DDR channels.
Conclusion
Neural networks prove to be extremely useful. However, neural networks require a lot of computational power. Moreover, neural networks need a large memory bandwidth to load the data. FPGAs were selected to solve the problems because FPGAs have a high memory bandwidth/cost ratio. Spartan-7 XC7S75-2 was selected because XC7S75-2 has the best bandwidth/cost ratio out of the Xilinx's 7 series FPGAs. Moreover, the Matrix Assembler was implemented to optimize the design of the Matrix Machine. The Matrix Assembler takes in neural network assembly codes and produces microcodes and VHDL codes. The VHDL codes form the structure of the Matrix Machine. The Machine Machine has multiple Mini Vector Machines that execute vector operations. The Mini Vector Machines allow neural network acceleration. Furthermore, the microcodes were used to schedule the executions of the Mini Vector Machines. The microcodes allow the FPGAs to switch neural networks without reloading the bitstream.
