ABSTRACT New chips for machine learning applications appear, they are tuned for a specific topology, being efficient by using highly parallel designs at the cost of high power or large complex devices. However, the computational demands of deep neural networks require flexible and efficient hardware architectures able to fit different applications, neural network types, number of inputs, outputs, layers, and units in each layer, making the migration from software to hardware easy. This paper describes novel hardware implementing any feedforward neural network (FFNN): multilayer perceptron, autoencoder, and logistic regression. The architecture admits an arbitrary input and output number, units in layers, and a number of layers. The hardware combines matrix algebra concepts with serial-parallel computation. It is based on a systolic ring of neural processing elements (NPE), only requiring as many NPEs as neuron units in the largest layer, no matter the number of layers. The use of resources grows linearly with the number of NPEs. This versatile architecture serves as an accelerator in real-time applications and its size does not affect the system clock frequency. Unlike most approaches, a single activation function block (AFB) for the whole FFNN is required. Performance, resource usage, and accuracy for several network topologies and activation functions are evaluated. The architecture reaches 550 MHz clock speed in a Virtex7 FPGA. The proposed implementation uses 18-bit fixed point achieving similar classification performance to a floating point approach. A reduced weight bit size does not affect the accuracy, allowing more weights in the same memory. Different FFNN for Iris and MNIST datasets were evaluated and, for a real-time application of abnormal cardiac detection, a ×256 acceleration was achieved. The proposed architecture can perform up to 1980 Giga operations per second (GOPS), implementing the multilayer FFNN of up to 3600 neurons per layer in a single chip. The architecture can be extended to bigger capacity devices or multi-chip by the simple NPE ring extension.
I. INTRODUCTION
Feed-Forward Neural Networks (FFNN) in different variations are one of the most used machine learning algorithms, with numerous applications typically running under PC-based software systems. However, when fast processing time is required in real-time applications or fast prediction, decision or classification, a PC-based system might not be able to provide enough throughput. Nowadays, this situation becomes very common since FFNN sizes are growing due to the complexity of the problems to be solved and big data
The associate editor coordinating the review of this manuscript and approving it for publication was Rajeeb Dey.
applications, with an increasing number of inputs, neuron units, and the number of layers. Moreover, power consumption and computational speed is an important issue; CPUs and GPUs can process data at a high speed, but the use of power and resources is higher than FPGA and other custom embedded hardware platforms [1] .
The computational resources and internal architecture possibilities of FPGA devices differ from classic Von Neumann PCs or even SIMD processing units as GPUs, CPUs or DSPs. FPGA are optimal for massive parallel and relatively simple processing units, rather than large universal computational blocks. This is the case of FFNN, which are composed of parallel inputs, parallel outputs and multiple neuron units arranged in layers. Thus, the FPGA device is a good candidate to be used as an independent device, receiving inputs directly from the process, computing them and sending the output to a real process. FPGA devices are one of the best options for the hardware implementation of FFNN in particular and artificial intelligence algorithms in general since required computations are based on the sum of products, which can fit very well into the FPGA internal slices (logic blocks, arithmetic units and RAM). Thus, the use of FPGA devices allows the parallelization of neural networks by using concurrent computing of multiple units, which can be massively interconnected and are able of be reconfigured with different weights and topologies depending on the target application. In addition, data representation can be tuned according to precision and accuracy requirements, as in [2] . Different applications can be found, as in the case of [3] , where a fault tolerant Hopfield Neural Network was implemented in FPGA for space applications, or in [4] , where a weightless neural network with Multivalued Probabilistic Logic Nodes (M-PLN) was implemented and evaluated. Other similar practical FPGA implementations of neural networks can be found, as in MPPT controllers for solar charging applications [5] , or in Software Defined Radio (SDR) modulation [6] . Alternatively, an FPGA accelerator can be connected to a PC in a Hardware In the Loop System (HILS), where input and output data are sent and received from the PC, guaranteeing a fixed processing time from the dedicated hardware [7] , being independent on the load from the host PC.
Specific hardware implementation of artificial neural networks can be beneficial to speed up both training and online processes, as in [8] , where the backpropagation learning algorithm was implemented, in [9] , where a neural fuzzy chip with on-chip incremental learning ability was described, or in [10] , where a fully pipelined acceleration architecture is designed aiming to alleviate the high computational demands of Restricted Boltzmann Machines (RBM). Further, the inclusion of artificial intelligence algorithms in embedded systems, targeting real-time applications, is common, as in [11] , where an optimized streaming method for the hardware acceleration of deep convolutional neural networks is shown, or in [12] , where the acceleration of Support Vector Machines (SVM) through a hybrid processing hardware architecture (optimized for object detection) was proposed. In addition, some other applications include modeling, as the case of a digital implementation of a modified astrocyte model [13] .
The process of generating specific hardware from a versatile architecture can be tedious. In order to assist to inexperienced users in the hardware implementation of neural networks, some works propose neural network software design tools using user-friendly visual graphical interfaces, where the hardware configuration files are automatically generated according to the user options. This is the case in [14] , where a complete design environment for migrating neural networks from software to FPGA hardware, including network training, was described, or [15] , which describes an end-user design environment where any FFNN can be modeled, simulated, and later programmed on an FPGA.
Concerning hardware topologies, a straightforward neural unit architecture might consist of using separate hardware entities to perform each input by weight multiplication, and a parallel adder to add the multiplication products. Such design would be a fine-grained architecture. However, such architecture is unpractical due to the very high hardware requirements in occupation and interconnection lines, which leads to high power and high resource usage, along with low speed architectures [16] . A different approach might use neuron units with serial processing, which is more practical because every unit just requires one Multiply-AndAccumulate (MAC) block, time-multiplexing data into the same units [17] . However, despite serial or parallel computation, all fine-grained architectures implementing directly the neural units suffer from the connectionist problem: the number of interconnection synapses grows exponentially with the number of units in the FFNN, consuming a significant part of resources and reducing the operating frequency due to long lines delay. Thus, with every new unit in a fully-connected or feedforward network, the topology of connections becomes more complex and the synthesis software tries to create connections using logic cells instead of connection lines, inefficiently using logic resources, adding delay and power consumption to the device. Then, a fine-grained architecture can be used only for small size networks, limiting its range of applications.
When implementing large size networks, a more promising approach is offered by a coarse-grained architecture where a small number of processing elements perform timemultiplexed serial computation of the network units. In this case, performance is a trade-off between processing node complexity and working speed: the simpler the processing node, the faster, but requires more clock cycles. In this approach, the hardware implementation benefits from short point-to-point data lines and pipelined uniform operations, obtaining higher clock speed and lower resource usage at the expense of higher latency. As an example, in [18] , the biggest FFNN layer was implemented and reused.
In this work, an unusual approach to design the proposed architecture was followed. Typically, custom computing architectures are defined according to the required algorithm calculations: after defining the control and data flow, the hardware architect makes use of required hardware blocks, which availability, resource occupation and performance might depend on the device used. However, in this work, the systolic architecture design considered the FPGA available resources as a premise: the existing FPGA on-chip resources were analyzed and then, the computational process was proposed. By doing this, an optimal use of resources is obtained, using short lines, reducing the use of logic resources, reducing local interconnections between blocks (avoiding delays due to long internal connections), increasing clock rate and, as a side effect, reducing power consumption, too. Thus, considering typical FPGA resources, the proposed hardware architecture was proposed: it is a versatile and universal Systolic Massive Parallel Architecture (SYMPA) for feedforward neural networks, based on computationally independent Neural Processing Elements (NPE) having local weight memory, global data input, and command lines. The resulting hardware structure is a combination of fine-grained and coarse-grained with parallel input processing (all neuron units of an FFNN layer process the same input at the same time) and time-multiplexed input (a new input every clock cycle). The SYMPA architecture allows the implementation of arbitrary size and arbitrary type FFNN. The computational procedure and its hardware architecture are described, but also an analysis of the proposed FPGA-based implementation is conducted using different topologies, to assess the level of optimization achieved and the weak points of the proposed implementation. The main contributions of this paper are as follows:
• Proposal of an architecture able to adopt any type of FFNN. Thus, it can be used to solve different applications using FFNN such as the Multilayer Perceptron (MLP), autoencoder (AE), or logistic regression (LR).
• The proposed architecture can scale up to arbitrary size (inputs, units and layers), only limited by the available resources. Scaling up has no penalty on the operation clock frequency. It provides linear resource growth with the number of neuron units in the largest layer of the FFNN.
• A single Activation Function Block (AFB) is required for the whole FFNN (not one per neuron unit as usual), easily permitting to modify this block as different applications may require different AFB. Moreover, each FFNN layer may use a different AFB if more than one is defined.
• The proposed architecture achieves up to 550MHz of operation frequency, being the AFB the limiting block and thus requiring a careful design or selection of the activation function.
• The architecture provides great versatility: the output of intermediate layers can be available externally and the weights of the neural network can be modified during execution without device reprogramming.
• In a Virtex-7 FPGA implementation, SYMPA architecture accelerates up to x256 times with respect to the PC implementation and can perform up to 1980 GOPS when using 3600 neuron units per layer. Despite some works demonstrate the feasibility of on-chip learning [19] , [20] , embedded learning is not considered in this work since weights are generally calculated using off-line procedures (backpropagation, ELM, etc.). Once calculated, the weight values are loaded to the FPGA internal memory.
Section II describes the types of FFNNs implemented in this work, followed by section III where the algorithm used for implementation is described. Section IV details the hardware implementation; section V describes the experimental results; section VI uses a real case application detecting anomalous heart rhythms to conduct a platform comparison, and, finally, sections VII and VIII deal with discussion and conclusions of the work.
II. MULTILAYER PERCEPTRON (MLP), AUTOENCODER (AE) AND LOGISTIC REGRESSION (LR) AS FFNN TYPES
The main characteristics of any FFNN are the basic computation neuron units and the topology where they are arranged. Structured in layers, each unit of a layer is connected to all units in the next layer, never in a cyclic or recurrent form, imposing an ever forward flow of information. Each unit j in a layer i performs a sum of products for all the outputs of the previous layer, Y i−1 , or the input layer, X , with a weighted value W i jk for each connection (synaptic weight), and adding a bias value b ij . The result serves as the input to an activation function F where the final output of the unit is generated. Eq. 1 shows the required computation for a single unit j in a layer i containing N inputs; the same operation must be repeated for all units in an FFNN. The activation function is non-linear, typically a sigmoid, although different approximations simplifying calculations exist [21] .
(1) Fig. 1 shows the most common structure for an FFNN, the Multilayer Perceptron (MLP). In a MLP, units are arranged in layers and forward interconnections exist between inputs, layers and outputs. Another type of FFNNs are autoencoders (AE), which aim to learn a compressed, distributed representation (encoding) of a dataset. An autoencoder network is a type of neural network whose main focus is to extract features that will help in reconstructing the original input signal back from those features efficiently. Autoencoders can be stacked (stacked autoencoders) [22] , [23] to form deep networks and were first introduced in the 80s by Hinton [24] . The simplest form of an autoencoder is very similar to an MLP [25] , with an input layer, an output layer and one or more hidden layers. An autoencoder can be seen as a type of MLP where the output layer has the same number of nodes as the input layer, and instead of being trained to predict the target value Y of given inputs X , autoencoders are trained to make the reconstruction X of their own inputs (Fig. 2) . Therefore, autoencoders are unsupervised learning models.
Autoencoder networks have shown excellent properties for feature extraction, data compression or visualization [26] , [27] . Autoencoders are popular [28] - [31] as they are used as a pre-training mechanism for deep supervised networks. Training deep neural networks is difficult as the magnitudes of gradients in the lower layers and in higher layers are different, it is difficult for stochastic gradient descent to find a good local optimum, then, as deep networks contain many parameters, they can remember training data and do not generalize well. With AE used for pre-training, the process of training a deep network is divided into some steps: training a sequence of autoencoders, one layer at a time using unsupervised data, train the last layer using supervised data and use backpropagation to fine-tune the entire network using supervised data.
Logistic Regression (LR) can also be considered a special architecture of neural networks [32] . The functional forms for logistic regression and artificial neural network models are quite different. However, an FFNN with only an output layer is identical to a logistic regression model if the logistic activation function is used (Fig. 3) . Logistic regression is widely applied in medical applications [32] , [33] , topography [34] or social networks [35] .
Additionally, other different variations of FFNN propose the use of a different activation function for each layer, or obtaining the output layer values without using any activation function, or using a SOFTMAX activation function for multiclass classification [36] . Note that, to obtain a more versatile architecture, it is important to be able to configure the activation function that calculates the output value in each layer.
III. ALGORITHM PROPOSED FOR THE GENERAL FFNN COMPUTATION
Despite the number of inputs, outputs, hidden units and number of layers, which may vary according to the application, the topologies for the aforementioned types of FFNN only differ in some interconnection schemes. This fact makes it possible to propose a versatile hardware architecture defining the computation procedure is defined independently on the number of inputs, layers and neuron units in each layer.
All FFNN share the following properties: 1) No connections exist among the units in the same layer.
2) The output of a layer is a function of the previous layer inputs and a bias. 
Eq. 3 shows the computation case for the N 1 units in the first layer, which results in Y 1 .
To adapt the data computation flow in a regular form, the bias values of layer i are renamed as w 0 i = b i and every VOLUME 7, 2019 output vector Y i from a layer is expanded by one element equal to 1 before entering the next layer (named Y i , N i−1 + 1 in size). Additionally, the bias column vector is concatenated with the weight matrix to create a layer matrix
×Y i−1 = S i provides all sum of products corresponding to each unit in layer i. Repeating the process for all layers (i = 1, . . . , N L ), it can be written as in Eq. 4. Finally, in order to obtain the output value for each FFNN unit, each S ij element must be applied the activation function F (Eq. 5).
Thus, Eqs. 4 and 5 show that computing the output of an FFNN requires two main operations: vector by matrix multiplication and activation function computation.
In general, the matrix by vector multiplication B× A of a matrix B (size M×N ) by a column vector A (size N ) can be parallelized in two forms:
• Multiplying in parallel all elements of A by the matrix row vector B i . This option requires N multipliers and the process must be repeated for all M rows in the matrix. All partial sums of a single unit are computed in one cycle, thus computing the partial sums of all units after M clock cycles.
• Multiplying in parallel a single element of the input vector A by each element of a matrix column B j . This option requires M multipliers and computes one partial weighted sum of all units in a layer in the same clock cycle. Thus, after N clock cycles, all partial sums are obtained. These two possibilities are different from the hardware architecture perspective. To obtain all partial sums in a FFNN unit, the first option requires the simultaneous multiplication of different pairs of data A k * B ik , whereas the second method requires the simultaneous multiplication of one fixed argument A i by all row elements B ki , i.e. the problem is reduced to scalar-by-vector computation since, in each clock cycle, one element is the same for all multiplications.
The second method was the selected option for the algorithm in this work because it requires N − 1 less data paths in hardware and the input vector A can be fed into the system element-wise instead of loading all elements at the same time. This feature is very useful when the FFNN is working with data stream in real time, i.e. a new data input value enters the FFNN per clock cycle. To illustrate this feature, Eq. 6 shows a computation example for an N 0 -input FFNN with three units in a single hidden layer, being X the input vector and W 1 the weight matrix where each row represents the weights of one unit and the first column contains the bias values for each unit. After computation (N 0 +1 clock cycles), the resulting vector S 1 contains the sum of products for each unit in the layer and only the activation function evaluation is required to obtain the final units' output value (not included in this equation). Using as many multipliers as units in the layer working in parallel (N 1 = 3 in this case, N i with i = 1, . . . , N L in general), the partial sums for all units will be simultaneously computed for each new input data X k , every clock cycle (assuming that a multiplication operation takes place in one clock cycle). The bold elements marked in Eq. 6 illustrate a partial sum result obtained for each unit in a single clock cycle when using three multipliers to process the input X 1 . Using three arithmetic accumulators (N i in general), the complete sum of products for each neuron unit in the layer are simultaneously obtained after N 0 + 1 clock cycles.
Once the output values of units in a layer are obtained, they are used as inputs for the next layer. The layer structure and computation scheme described above can be repeated for all layers using the same computation elements, i.e. the same hardware architecture. Fig. 4 graphically shows the proposed, layer-wise, parallel feedforward architecture for a
. Initially, the inputs in a layer are serially processed by the layer computation blocks, feeding one input each clock cycle as described above. After obtaining the results of the units in one layer, its layer output values S i (i = 1, ..., L) are stored in a memory (inter-layer memory) and the same hardware can be iteratively reused to compute all layers. It is important to note that the stored values S i correspond to the sum of products result without activation function evaluation. It is only before entering into the next layer (or final output result) that they are evaluated by the activation function. In other words, the layer output values are stored before being evaluated by the activation function, and they are only evaluated when they are needed. Applying this mechanism, together with the serial input processing, a single activation function block can serve for all the FFNN structure since one value per clock cycle is used, no matter the FFNN size. The input vector X is serially entered and processed (orange arrow denotes sequentiality) by the array of multiply-accumulators to calculate the weighted sum of products. As the output layer values S are generated, they enter the activation function block, generating Y as the output of the layer. When one layer is finished, the computation is repeated for the next layer.
As in FFNN architectures, every unit in a layer is connected with all units in the next layer, the layer-wise parallelism is very convenient as every layer output value is dependent upon the output values of the previous layer and the dataflow always goes in one direction.
As the number of units in each layer varies, the number of required computational blocks and associated control of computations changes from one layer to another. In order to the hardware architecture can fit all layers, the number of hardware processing units N is given by the highest number of units in a layer, considering all layers in the FFNN (N = max(N 1 , . . . , N L )). Then, by careful design of the control flow, only the required number of computational units will be used in each layer computation. This architecture allows to process all units in a layer in parallel, with inputs serially processed in a pipelined fashion.
Taking into account all the aforementioned factors, the proposed hardware implementation of this versatile and universal SYstolic Massive Parallel Architecture (SYMPA) for FFNN is based on computationally independent Neural Processing Elements (NPE) having a Multiply-Accumulate (MAC) unit, local weight memory, global data input, and command lines. For the inter-layer communication memory implementation, all NPEs also include a scratchpad register connected in a daisy-chain, forming a scratchpad ring SR. In this architecture, all control signals but one are globals, and thus, the system has excellent scalability with linear dependency between the size of the network layer and hardware occupation. The hardware implementation benefits from short point-to-point data lines and pipelined uniform operations. An additional feature of this architecture is the external availability of intermediate result values after each layer computation, which can be used for network training algorithms or any other debugging purposes.
Algorithm 1 describes the sequence of operations required when computing an FFNN with L layers. The loops involving k index are those performed in parallel by NPEs, the loops for j index are serially computed, and the loops for i index reuse the hardware computation architecture. To demonstrate the
Algorithm 1 Computation Process for a FFNN Network
L → Number of layers N i → Number of NPEs used for layer
Flush results into the SR Processing layers i = {2, ..., L} for i=2 to L do layers for j=1 to N i do units
scratchpad daisy-chaining, it is presented as a vector (SR), common for all units. The activation function is denoted as AFB.
The proposed algorithm consists of three main parts: accepting the network inputs, forward propagation of the signal through the layers in the network, and obtaining VOLUME 7, 2019 FIGURE 5. Hardware structure of a 3×3×2 MLP neural network. NPEs are created and connected according to the number of units existing in the biggest layer. In this case, three NPEs are required, the blue dashed line shows one NPE with BRAM memory, ALU, scratchpad register (SR) and all existing data and control lines. From top to bottom starting in the first layer, each NPE memory contains the unit weights. As two output units exist, the left-most NPE is empty for the bottom half memory since only one unit weights are stored. A single activation function (AFB) is used for the whole FFNN.
the outputs of the final layer. In the first part, the input data are externally taken from the FFNN inputs, which are sequentially introduced and concurrently multiplied by its corresponding weight w 1 jk of each unit j in each NPE. The multiplication result of each NPE is added to the accumulator register by a MAC operation. As the bias value is stored in weight index zero (w i j0 ) for each unit, it is loaded in the MAC unit of each NPE in the first clock cycle, before the inputs to the accumulator enter. After N 0 + 1 cycles, where N 0 is the number of network inputs, the accumulator of each unit Acc(j) contains the sum of products of inputs by weights. Then, computation of the first layer is done and the values of the accumulators are stored in the scratchpad ring SR. Now, the NPEs are ready to compute the next layer. Note that, for those layers with a lower number of units than N, not all the NPEs of the hardware structure will be used.
The scratchpad ring SR is a serially connected line of registers, similar to a shift register with a parallel load. After latched, data can be shifted out serially while NPEs compute a new sum of products. The SR contains the computed sum of products values and they already have to be shifted through the AFB block to calculate the final output value.
As it can be seen in Algorithm 1, the input data are serially entered to the NPEs to compute the first layer; for the remaining layers, the layer input data are those obtained from the output of the previous layer. From the hardware perspective, the proposed architecture consists of a single layer of concurrent NPE units with common input, where each layer is computed using the weights of the corresponding layer, only using the number of units required for each layer.
After computing the sum of products value of the units in the output layer, the network output can be externally accessed by serially reading on the AFB output port, finishing the FFNN computation. Since the AFB output port is externally accessible and intermediate output unit values go through this module, it is also possible to read internal unit output values.
IV. HARDWARE ARCHITECTURE
When developing the computation algorithm in the previous section, specific hardware blocks existing in FPGA were considered beforehand: distributed memories, arithmetic units, logic and different types of interconnections. By doing this, we guarantee that the proposed implementation uses standard blocks with the aim of obtaining an optimal and efficient implementation on existing commercial hardware. This favors the portability to any FPGA, regardless of the manufacturer, or even a VLSI device.
The proposed architecture, shown in Fig. 5 , is a generic architecture that can be arbitrarily extended in number of layers and units per layer as long as hardware resources are available. Actually, since layers reuse the hardware, the main limiting factor is the number of units in the biggest layer, and the number of input. This is especially beneficial to deep multi-layer neural networks. Each NPE acts as a neuron unit of the FFNN for each layer computation. At most, one NPE will be reused as many times as layers exist in the SFNN. Weight values corresponding to all units that will be computed by the NPE are stored in its internal local memory, i.e. each NPE contains the weight values of one unit per layer, at most. In case of computation of layers with a lower number of units than the biggest layer, some units will be unused and the corresponding local memory will not be fully filled. The NPE architecture is designed to accumulate the partial sum of products of the current unit in the ALU accumulator register, and, when finished, the resulting value is moved into the corresponding SR register so that they can be shifted through the SR daisy-chain into the Activation Function Block (AFB) and, simultaneously, the NPEs can compute the next layer values by feeding back the AFB output value to the NPEs, or providing the output values in case of the output layer. The order of operations is defined by a Finite State Machine (FSM) block. The FSM controls the weight loading into memories, and the addressing of local NPE memory depending on the layer computation being carried out, the NPE usage depending on the layer, and the SR latching.
An important feature of this architecture is the modular structure and minimization of connections: adding more NPEs gives linear growth in the hardware occupation with most of the connections being internal in the NPE. The only external connections in an NPE are the addressing, memory write enable (we), and SR register connection with the previous and next SR. As seen in Fig. 5 , the NPEs must be placed with alignment to the right, i.e. the right-most NPE must contain the last unit of each layer and the rest of units will be arranged in NPEs from right to left. As not all layers contain the same number of units, the left-most NPEs will not be used for certain layers. Fig. 5 shows the placement for a 3×3×2 FFNN where the left-most NPE only contains weights from the hidden layer since the output layer contains two units. At least, weight values for one unit in a certain layer are stored in one NPE memory. In case a local NPE memory contains weights for units from several layers, the addition of an offset to the NPE memory addressing is the only modification required to reuse the hardware architecture for different layers. The data input in the system is serially performed through the DIN port, the control system allows weight update using the same data port (including bias values), at any time. This feature is very useful when weight values need to be modified without device reprogramming, once the hardware system is running.
The total number of weights in an FFNN is given by Eq. 7a, which corresponds to the memory size required for weights in the whole design. However, since weights are distributed in different memories, as many distributed memory blocks as NPEs are required. The size of each distributed memory block is the key factor to properly optimize the resources in the NPE implementation. The maximum distributed memory size required for each NPE is given by Eq. 7b. Despite some NPEs may require less memory, all NPEs are defined according to the value given in Eq. 7b in order to maintain a regular hardware structure which eases the memory addressing.
For such a versatile architecture, it becomes very important to be able to customize the FFNN implementation according to parameters which can be easily modified to generate the hardware definition of any FFNN. Thus, a set of configuration parameters is defined. Using these parameters, the synthesized FFNN hardware will be obtained. The required information to properly generate the FFNN structure is the following:
• Number of NPEs. Is the maximum number of units in a layer, considering all layers, in the FFNN, i.e. NPE# = max(N i ) with i = {1, . . . , L}.
• Memory size per NPE. This is the maximum memory size required by any NPE in the FFNN as indicated by Eq. 7b.
• Weight bit size. It is important for the estimation of memory requirements and must be determined according to the required accuracy.
• Fractional part bit size: Together with the total weight size, this value must be chosen according to the accuracy requirements.
• Number of layers L.
• Number of inputs N 0 .
• Number of units in each hidden layer N i (i = {1, . . . , L − 1}).
• Number of outputs N L .
A. NEURAL PROCESSING ELEMENT (NPE) DESCRIPTION
As shown in Fig. 5 , a single NPE consists of three blocks: a RAM block with single-cycle read/write access for weight storage, an ALU, and a scratchpad register SR.
Having each NPE its own distributed RAM block allows for concurrent NPE operation. The RAM block of the i-th NPE must contain several weights' banks, one for each i-th unit of each layer in the FFNN. In order to estimate the final RAM block size for the NPEs, the bit size of data is also required: reducing the weight data size allows lower memory size. This issue is discussed in detail in Section V-D.
The ALU required for the arithmetic computations must perform the following operations: P = 0, P = A * B, P = A * B + P, P = A, where A and B are N-bit input data and P is the accumulator where resulting data are stored, also serving as data output of the ALU. The P = A operation serves as an ALU bypass from DIN to the memory block required in a weight update operation.
The scratchpad register SR is the third component of the NPE. Each NPE scratchpad register is connected to the adjacent NPE scratchpad register forming a ring. The register contents can be loaded from the din(SR) input (through VOLUME 7, 2019 the ALU), or from the adjacent scratchpad register on the left (using the cin(SR) port connected to the previous NPE). Data input source is selected by the DSRC input signal controlled by the FSM. The scratchpad register latches its content on the dout(SR) output port every clock cycle, acting as a rotating register in case of DSRC = 1. When the final sum of products of a certain unit is computed, the scratchpad register value is updated from ALU (DSRC = 0), which simultaneously occurs for all NPEs. After that, the FSM starts shifting them out to the Activation Function Block (AFB). In turn, the AFB output serves as input of the processing array to compute the next layer or provides final FFNN output values.
B. ACTIVATION FUNCTION BLOCK (AFB) IMPLEMENTATION
The systolic nature of the proposed architecture makes it possible that just one Activation Function Block (AFB) is necessary to perform the neural computations of the whole FFNN. Despite its obvious impact in resource usage reduction, this fact enables an easy modification of the used AFB block, with a low impact on resource usage, opposite to hardware implementations where one activation function per unit is required. Thus, the hardware complexity of the single AFB block can be reasonably high with the sole consideration that its performance must be high enough to work at the same clock frequency that the NPE blocks, to avoid bottlenecking. Additionally, it is possible to implement FFNNs using different AFB for each layer by implementing as many AFBs as desired, and multiplexing them during the computation process. The following activation functions were implemented in this work: ReLU, Logistic Sigmoid and Hyperbolic Tangent.
1) RELU
It is a relatively new type of activation function, becoming a trend in the last 10 years. Networks using the ReLU activation function can be trained faster and have sparser activations. The ReLU output is defined in the range [0, ∞]. As its implementation consists of a sign bit evaluation and one conditional signal assignment (Eq. 8), it is the simplest of the proposed activation functions and can be directly implemented in fixedpoint arithmetic.
It is a classic differentiable function, used in networks trained with gradient descent methods. The logistic sigmoid is defined in the range [0, 1]. Two approximations of this activation function were implemented:
• Classic piecewise-linear (PLA) [37] . The PLA calculation procedure is shown in Fig. 6 . It is based on shift and add operations, where every approximation is described by the line y = Sx + B, with coefficients S chosen to be a power of two. As only comparison, addition and bitwise shift operations are used, the resulting implementation requires low hardware resource usage. A 9-line approximation was implemented due to its reduced MSE (as seen in Section V-C). It requires 16 constants to store, one array for comparison, and one array for constant addition. Due to the symmetrical nature of the sigmoid, the 9-line approximation has a precision of 18-line PLA. The implementation requires a 3 clock cycle pipeline.
• Zhang second-order approximation [38] . It requires a single-multiplication, as described in eq. 9, implemented using an ALU block and a bit shift (multiplications by the power of two are replaced by bit shift operation). This implementation can be obtained using a 4 clock cycle pipeline.
3) HYPERBOLIC TANGENT Defined in the range [−1, 1], it is another classic differentiable function, used in networks trained with gradient descent methods. It was implemented using the second-order approximation function proposed by Kwan [39] , based on the FPGA implementation by Rosado-Muñoz et al. [40] . The original expression for this approximation is described in Eq. 10, where V is the parameter controlling the slope of approximation function.
To achieve maximum performance, this function has been slightly modified and parallelized using 2 ALU blocks with a 5-clock pipeline delay as depicted in Eq. 11.
C. NETWORK CONTROL SEQUENCES
According to the defined FFNN parameters (number of inputs, N 0 , number of layers, L, number of neuron units in hidden layers, N i with i ∈ {1, . . . , L − 1}, and number of outputs, N L ), the Finite State Machine (FSM) was designed to automatically execute the required computation sequence.
Algorithm 2 Control Sequence for Weight Loading
A → ALU Global Input ADDR → Address bus Set P = A mode ( All ALUs) using OPCODE A ( All NPEs) ← DIN Weight from external input into A Set ADDR Address of the weight Wait for weight propagation Through ALU pipeline we ← 1 Write Enable NPE Wait one clock cycle
(Repeat the sequence until all weights are stored in NPEs)
For the sake of replicability, the control sequence of the load of weights is described in Algorithm 2, and the control sequence of the FFNN output computation is described in Algorithm 3. Note that, in both algorithms, some adjacent pseudocode steps are executed in parallel. Fig. 5 can be used as support to illustrate the use of the lines and buses referenced in both algorithms.
Most FSM operations are cyclically performed using two counters: one for the address generation, and another for the cycle count. Provided the FSM simplicity, with few continuously repeated states, the hardware occupation of the FSM is negligible compared to that of the NPEs.
D. NUMBER OF CLOCK CYCLES OF EXECUTION
From the point of view of the number of clock cycles for execution, the SYMPA architecture presents a very efficient and deterministic behavior. Its systolic nature and mixed serialparallel architecture permit to use pipelining efficiently: during the layer computation, input data are processed at a rhythm of one input per clock cycle, i.e. a N i input layer requires N i +1 clock cycles to be computed. Depending on the
Algorithm 3 Output Computation Control Sequence
A → ALU Global Input ADDR → Address bus AFB → Activation Function Block output N i → Neuron units at layer i Computation of first layer Set P = A * B mode ( All ALUs) using OPCODE ADDR ← 0 bias reading for all units in first layer DIN ← 1
All bias multiplicated by one P ← A * B ( All ALUs) bias in B loaded into P Set P = P + A * B mode ( All ALUs)
using OPCODE for each input data X(n) do n = {1, . . . , N 0 } Increment ADDR Adjust address to read weight A ( All NPEs) ← DIN ← X(n)
Input to ALUs implementation form of the hardware blocks, some additional cycles are needed for the interlayer delay: ALU (T ALU ), SR (T SR ) and AFB (T AFB ). Thus, after the last input of a layer i has entered to the core, the next layer i + 1 will be computed after T ALU + T SR + T AFB clock cycles. As an example, for a FFNN with N 0 inputs, N 1 units in the hidden layer and N 2 units in the output layer, the total computing time of the FFNN output of a input data pattern would be (N 0 +1)+(N 1 +1)+N 2 +2 * (T ALU +T SR +T AFB ). All layers account for bias calculation time adding one clock cycle, except the output layer, which does not need bias calculation. In general, the FFNN computation time can be described according to the number of clock cycles, C, described in Eq. 12. In case of the clock cycles required for weight loading, C wload provides the value.
V. RESULTS
Once defined and characterized as shown in previous sections, the architecture was coded in VHDL. As a particular implementation case, we used Xilinx ISE Design Suite 14.7 for synthesis and implementation, using as target device the Xilinx XC7VX485T-2FFG1761 Virtex 7. The implementation was done using 18-bit word-length fixed point signed arithmetic (DIN , DOUT and CIN ), with a fractional part of 12 bits. Nevertheless, different bit sizes were also tested. In order to validate the architecture, four datasets were used. Three standard datasets, and one additional dataset aimed at a real case application. The datasets are:
• Iris. Dataset using four parameters per input pattern and three output classes.
• Full MNIST. Dataset with 784 (28×28) grayscale 8-bit pixels per sample and 10 output classes.
• Reduced MNIST. Dataset with 400 (20×20) grayscale 8-bit pixels per sample and 10 output classes.
• MIT-BIH & AHA. The MIT-BIH Malignant Ventricular Arrhythmia [41] and AHA (2000 series) [42] databases were processed as in [43] , [44] to obtain 15 features. One output class identify two different types of rhythms (normal and abnormal). Each one of the above classification problems were trained for different test topologies (MLP, AE and LR) with the scaled conjugate gradient descent algorithm (MLP used backpropagation) in Matlab R2017b using the Deep Learning Toolbox [45] . LR and MLP performance was calculated as recognition error and, in case of the autoencoder, the cost function was the reconstruction error (MSE). Finally, the selected topologies were:
• Iris:
4×10×3 MLP (HT)
• Full MNIST: 784×196×784 AE (ReLU) 784×600×600×10 MLP (HT+ReLU)
• Reduced MNIST: 400×40×10 MLP (LS) 400×40×40×10 MLP (LS) 400×10 LR (LS)
•
MIT-BIH & AHA:15×20×20×1 MLP (HT)
The list above also indicates the activation function used in each implementation. The Logistic Sigmoid (LS), Hyperbolic Tangent (HT), and ReLU activation functions were used. The use of different activation functions aims to analyze their resource usage and how they affect the performance of the whole design.
The 784×600×600×10 MLP implementation of the Full MNIST classification problem was selected to illustrate the versatility of the proposed architecture for large FFNN, which permits to use different activation functions by layer, without impacting performance. In this case, the hyperbolic tangent was used for all layers except the output layer, which used the ReLU activation function. Table 1 shows the resource usage of six different FFNN implementations. The number of DSP48E blocks matches the number of NPEs (each NPE uses one DSP48E block in its internal ALU) plus the number of DSP48E blocks required for the AFB. Different requirements in DSP48E blocks for the used AFB are seen in the table. The table clearly shows that the number of used LUTs and slice registers are a linear function of the number of NPEs.
A. HARDWARE RESOURCES
Concerning distributed RAM memory usage (BRAM36 memory blocks), the required number is NPE/2 since all implementations use 18-bit word length and BRAM36 can thus accommodate a 36-bits word-width which is shared by two NPEs. Depending on the word-length and distributed RAM block used (it varies from one device family to another, or from device manufacturer), this value could change. Using word-length above 18 bits would imply dedicating a single BRAM block per NPE. Implementation in word-length divisors of 36 is preferred, especially 18 and 9 bits, natively supported by manufacturer cores. In this case, as we chose 18-bits word-length for weights and BRAM in the selected device is 1024 in size, each NPE can accommodate up to 2048 weights.
It is also important to consider the number of weights to be stored in memory since the size of distributed RAM block also varies from device and manufacturer; a large number of weights stored in a single NPE can imply an extra block RAM per NPE. This is the case of the 784×600×600×10 FFNN in Table 1 , where one BRAM block per NPE is required.
Considering multilayer FFNN, the resource usage of the implemented architectures does not significantly change as long as new layers contain the same or less units than previous layers. As an example, 400×40×10 and 400×40×40×10 implementations show nearly the same hardware requirements in terms of DSP48E, LUT, and BRAM36 blocks. Table 1 clearly shows that the proposed design achieves a very high frequency of operation across implementations. In fact, the core architecture can work at 550MHz, which is the limiting frequency of operation specified by Xilinx for DSP and BRAM slices in Virtex7. In other words, as the proposed architecture requires a reduced amount of logic and block slices, along with short delays in interconnections, the maximum frequency of operation of its core design is only limited by the frequency of operation of the used FPGA device technology. This is why the maximum frequency of operation for a large FFNN (784×196×784) implementation is 550MHz.
B. MAXIMUM FREQUENCY OF OPERATION
However, it can also be seen that implementations using the hyperbolic tangent activation function have a maximum frequency of 490MHz and those using the logistic sigmoid activation functions have a maximum frequency of 498MHz. This is because, although the core architecture frequency is only limited by the frequency limitation of its slices, the clock speed limiting block in the whole architecture is the AFB. In case of the 784×196×784 implementation, the maximum frequency of operation achieves 550MHz due to the implementation simplicity of its ReLU activation function, which avoids the AFB bottleneck and permits the maximum frequency of operation to match the maximum frequency of operation of the device. On the other hand, implementations using the hyperbolic tangent and sigmoid logistic activation functions present lower maximum frequency of operation. This must be taken into consideration when maximum throughput is required.
Being N the number of NPEs, the peak performance of the whole design (using the ReLU activation function) is f max * N synaptic Operations Per Second (OPS). Note that this benchmark depends on the selected topology (e.g. in case of the 784×196×784 AE using the ReLU activation function, the total performance is: 550 MHz×784 NPEs = 431.2 GOPS. As the maximum estimated number of NPEs in a Virtex-7 family device is 3600 (maximum number of DSP48E blocks included in a device), the proposed architecture claims to perform up to 1980 billion operations per second (GOPS) on the biggest Virtex-7 FPGA device. However, using a multi-chip approach, the size of the FFNN could be enlarged by a simple interface between different devices as few lines are required to connect NPEs with the FSM and other NPEs.
C. ACTIVATION FUNCTION
Four Activation Function Blocks (AFB) have been implemented according to the approximations described in Section IV-B:
• PLA LS: Classic piecewise-linear approximation of the Logistic Sigmoid activation function.
• Zhang LS: Zhang's 2 nd order approximation of the Logistic Sigmoid activation function.
• Kwan HT: Kwan's 2 nd order approximation of the Hyperbolic Tangent activation function.
• ReLU: Rectified linear unit.
It is important to analyze the impact of the approximations in the accuracy results. Fig. 7 (top left and top right) shows the similarity in shape of the real-valued non-approximated function and the proposed approximations for Hyperbolic Tangent (Kwan approximation) and Logistic Sigmoid (PLA and Zhang approximations). The bottom left and bottom right show the absolute error for the approximated functions, which is always under 4.3% in case of Kwan approximation, and 2.1% in case of PLA and Zhang approximations.
All four approximations were independently implemented in hardware in order to verify the required hardware resources. The summary of the FPGA implementations is presented in Table 2 . The upper half of the table reports the resource usage, i.e., the amount of Look Up Tables (LUTs) , Registers and DSP blocks (DSP48E); no memory is used in any implementation. As it can be seen, the PLA LS approach triples the LUT usage of the Zhang LS approximation, whereas avoids using DSP48E slices (because only uses shifts and adds are used). On the other side, the ReLU implementation shows a really low use of resources because of its simplicity. Nevertheless, taking into account that only one AFB is needed for the whole implementation, it can be considered that, in general, the resource usage of the Activation Function Block (AFB) is negligibly small in all cases and then, does not influences the design complexity. However, the performance is important in order to obtain the fastest clock operation as possible. The bottom half of Table 2 shows the maximum clock frequency (f max ), which is clearly dependent on the pipeline design (except for ReLU), less pipeline stages decreases clock frequency and increases the amount of logic used. The table also shows the MSE and maximum error for the approximated functions. The number VOLUME 7, 2019 of pipeline stages T AFB depending on the used AFB must be considered in the design of the FSM for proper data synchronization, as described in section IV-D.
Given the simplicity of the ReLU implementation, its clock frequency limitation comes from the delay in logic resources, which is 550MHz. On the other hand, Zhang LS and Kwan HT implementations show similar complexity, achieving an f max around 490MHz. Finally, an f max of 270MHz reveals that the PLA LS implementation is a hard bottleneck for the whole performance of the architecture. This illustrates the paramount importance of the AFB block design for achieving good performance in the proposed architecture. As a result, the PLA implementation was discarded for further analysis and not included in accuracy results.
D. ACCURACY
In addition to performance and resource occupation, a very important issue lies in the accuracy of the proposed computing architecture since fixed-point arithmetic is used. An analysis of six different FFNN implementations was carried out, comparing the output of the neuronal network implementations with its 64-bits floating point PC Matlab-based counterpart. Using Matlab, it was found that Iris MLP implementation obtained 98.67% classification accuracy, both the MLP and LR implementations of the Reduced MNIST dataset obtained 95.2% classification accuracy, and the Full MNIST autoencoder reconstruction MSE was 0.21. Table 3 TABLE 3 . Accuracy results for several FFNN classifiers with variable data size and using three different datasets. The MSE was calculated against 64-bit floating point Matlab implementation. Default format is Q6.12 unless other fractional part stated.
shows the MSE error of the reached classification accuracy for different FPGA implementations when compared to Matlab.
In order to evaluate the influence of fixed-point arithmetic, the Full MNIST dataset using both a 784×196×784 AE and a 784×600×600×10 MLP were implemented with 18-bit word-length and three different fractional part sizes: Q6.12, Q9.9 and Q12.6. As expected, the MSE error increases when bit size decreases (Table 3) but the MSE error is still negligible and thus, it can be considered that fixed-point arithmetic is not affecting the neural network results.
Note that the largest FFNN with ReLU units has a low MSE due to the ReLU activation function which provides mathematically exact results regardless of its fixed/floating point representation. This network was trained with regularization and dropout to obtain small weights ([−1, 1]) and avoid overflow problems with easy fixed-point implementation.
Concerning the Logistic Sigmoid, it is a very efficient data range limiter, limiting data amongst the layers into the [−1, 1] range. Thus, when the implementation uses saturated arithmetic, the numeric data overflow does not become a problem and fixed-point arithmetic is valid. In fact, Table 3 shows that MSE error is mostly linked to the AFB implementation approximations done, rather than the fixed-point implementation versus floating-point implementation.
One of the interesting and useful properties of neural networks is their robustness to weight rounding. This fact can be used to optimize the hardware resources by reducing the memory size. Table 4 gathers the classification accuracy for different sizes of fractional parts using three implementations, including the large 3-layered 784×600×600×10 for the full MNIST. The obtained results show the fact that the FFNN using between 6 and 10 bits of fractional part have comparable performance to the FFNN using double precision floatingpoint weights. This is in the line with some studies [46] , [47] showing that weight precision can be drastically reduced without compromising the network accuracy.
VI. REAL CASE APPLICATION
To test the performance achieved by the proposed architecture on the FPGA against other platforms, a real case application is proposed. The aim of this application is to discern between the normal function of the heart and several pathologies as Ventricular Tachycardia (VT) and Ventricular Fibrillation (VF), amongst others. To feed the classifier, the input data (ECG signal) were preprocessed in several stages [43] , [44] . The first step consisted of a baseline wandering removal (denoising), Fig. 8 , using an 8th order IIR Butterworth bandpass filter with a response range from 1 Hz to 45 Hz. In the following stage, previous to a time-frequency Pseudo Wigner-Ville representation, a window signal alignment was required. The result was a bidimensional matrix image, which dimensionality was reduced with a kernel average, and, finally, the smoothed image was subsampled obtaining 15 values used as input data to the FFNN. Thus, the classification phase executed by the neural network is the last step to identify the normal/non-normal behavior of a human heart. For comparison purposes, we only analyzed the neural network.
A multilayer perceptron (MLP) was proposed, using the MIT-BIH Malignant Ventricular Arrhythmia [41] and AHA (2000 series) [42] database for training and testing. Two-thirds of the data were randomly chosen for training whereas the rest of the data were used for testing. The classifier was designed and trained using back-propagation using the Matlab 'Neural Networks' Toolbox. The hyperbolic tangent was selected as activation function. Finally, an MLP with 15 inputs, 20 neurons in the 1 st hidden layer, 20 neurons in the 2 nd hidden layer and one single output was obtained (15×20×20×1).
The neural network was implemented in the FPGA using the 2 nd order Kwan approximation of the hyperbolic tangent as activation function [43] . The complete architecture, including the AFB block, was configured to operate in Q6.12 fixed-point format.
The resource usage and performance is detailed in Table 5 . It also shows the number of cycles required to process all 15 inputs. Results in the table show a reduced use of memory, high performance (490MHz) and low number of clock cycles (84) required for each 15 input processing, which means that 5.83 Msamples/second could be processed (171ns processing time).
To evaluate the results of the FPGA-implemented FFNN classifier, Table 6 [48] . In this case, the processing time would be 3.3 µs @120 MHz. However, the same computation running in Matlab needs an average of 43.79 µs per input pattern. In any case, the FPGA implementation is able to perform the computation in a much shorter time, using a reduced hardware.
VII. DISCUSSION
The proposed architecture provides great versatility: it can implement an arbitrary number of layers without hardware increase except in the RAM for weight storage, which is extremely useful in case of deep multi-layer neural networks. The outputs of intermediate layers can be externally accessed as the output of the AFB block where all units output are evaluated is connected to an external port. This feature can be used for network training algorithms or any other debugging purpose. Additionally, the weight values can be modified during execution by writing in the RAM memories, without device reprogramming. Another relevant characteristic of this architecture is the use of a single Activation Function Block (AFB) for the whole FFNN. By serial feeding, this AFB block evaluates the non-linear neuron output function for the sum of products generated in each FFNN neuron unit. At first glance, it may appear that having a unique AFB block for the whole neural network may affect performance, but it allows to maximize performance. As only one block is necessary, the required amount of resources for this block is negligibly small (Section V-C) compared with other approaches using one activation function per neuron unit. Moreover, it is possible to implement several AFB blocks which can be switched in different layer computations; as an example, Table 1 shows the results of implementing the 784×600×600×10 MLP using the Kwan HT activation function in all layers except the output layer, which uses the ReLU activation function. Table 2 illustrates the paramount importance of the AFB design in this architecture. Here, the differences in the maximum frequency of operation are exclusively due to the AFB implementation. ReLu implementation is the fastest, enabling the FFNN to work at 550MHz (in this case, the maximum frequency of operation of the FPGA modules: DSP48E and BRAM). On the other side, the PLA implementation of the Logistic Sigmoid makes the overall speed to fall down to 270MHz, which means that this block is bottlenecking the system. In turn, the Zhang LS and Kwan HT implementations show better performance, achieving around 490MHz, which indicates that both are bottlenecking the system but can maintain a high clock frequency. It is also important to consider the pipeline-delay for a different AFB (T AFB ). The number of cycles of execution is very deterministic in this architecture. It is described by Eq. 12, where T ALU , T SR and T AFB are additional cycles due to the propagation time in the ALU, Scratchpad Register and AFB block, respectively. As an example, Table 7 shows the number of cycles achieved for four different implementations, with T ALU + T SR = 8. The output computation time also takes into account the clock frequency. As an example, in case of the 784×196×784 AE using the ReLU activation function, the total performance is 550 MOPS per NPE×784 NPEs = 431.2 GOPS and the output computation time is (1/550 · 10 6 ) · 1786 = 3.24µs.
The comparison amongst hardware platforms reports a considerable acceleration when using the FPGA implementation of the proposed architecture. This study was conducted using a real case application and the same computation algorithm on all platforms, Table 6 . Thus, when the FFNN is implemented in an FPGA, the output (the result of classifying the inputs) is generated after 84 clock cycles, which is 171ns @490MHz. In turn, a careful assembler codification of the algorithm in an LPC4337 32-bit ARM Cortex-M4 microcontroller requires 3.3 µs @120MHz. In this case, the sequential nature of the execution and the Von Neumann architecture restricts the efficiency of the computation, in front of the parallel FPGA computation. Finally, the execution of the FFNN in Matlab running in a PC (Intel Core i7-7700HQ CPU) needs an average computation time of 43.79 µs @2.80GHz. As it can be seen, the FPGA accelerates the computation ×20 times than the LPC4337 MCU and ×256 times than a PC, not considering power consumption, which is generally much lower in an FPGA than MCU or CPU.
To enable hosting very large FFNN with reduced memory occupation, the proposed architecture uses fixed-point for arithmetic and weight storage. Nevertheless, it have been demonstrated (section V-D) that using 18-bit word-length for weights achieves a classification performance comparable to double precision floating-point weight values and computation (Table 6 ). Furthermore, a value between 6 and 10 bits for the fractional part reveals to be enough to achieve similar classification accuracy than floating point. This fact is very important due to the high number of weights existing in large FFNN, thus requiring a large memory for storage. However, this architecture allows to include more weights in the same number of NPEs by using their BRAM.
A direct comparison of the proposed architecture to other works in the bibliography is difficult. Different works take different approaches to the architectural solutions and authors tend to use relativistic metrics. In a search for similar works, Table 8 shows three implementations using the architecture proposed in this work and eight approaches made by other authors. In order to have comparable values, the hardware resources used in each implementation are normalized to the total number of units in the FFNN, e.g. a 4×8×3×3 MLP requires 14 neuron units, which means that reported total resource values are divided by 14 in order to obtain the normalized resources per unit.
In case of Ferreira and Barros [49] , they achieved a ×36 speedup over a GCC compilation on a Linux PC, using a Intel Xeon @1.6GHz, for a 4×8×3×3 MLP (14 units) . The use of memory, LUT and registers are significantly higher than our equivalent implementation of a 4×10×3 MLP (13 units).
Vranjkovic and Struharik [50] reported a coarse-grained accelerator on the same Virtex 7 platform as this work, they propose several FFNN implementations and provide average resource occupation results. The report 113MHz of maximum operating frequency and an average of ×48 speedup over Weka/PC software implementation. Note that our proposal works at 490MHz using the logistic sigmoid activation function, and accelerates by ×256. The proposal uses more DSP blocks and registers, while a slightly lower value for memory.
Suzuki et al. [51] showed a 4×2×4 autoencoder architecture achieving 231MHz of maximum clock frequency. In this case, all reported occupation values are remarkably higher than our proposal. Furthermore, we use a single clock data processing rate for the whole system, whereas [51] uses several clock signals.
Nedjah et al. [52] computed a 220×24×10 MLP (34 units) in 356 clock cycles, and our implementation would solve it in 276 clock cycles (33% less clock cycles). The normalized memory usage for our similar 15×20×20×1 MLP (41 units) is also lower.
Oliveira et al. [53] implemented a 4×8×3×3 MLP Iris problem using 90 clock cycles and 77.8MHz, whereas our architecture would do it using 51 cycles at 490MHz at lower memory usage.
Huynh [18] propose different FFNN implementation in their work. They used a similar approach to that of our work: they implement all neuron units of the largest layer and serial processing. For a 784×126×126×126×10 and 784×40×40×40×10 MLP for Full-MNIST dataset classification, i.e. 388 and 130 neuron units, respectively. They obtain a much lower clock frequency and clock cycle number, while being slightly better in memory and DSP, but using more registers and LUTs when comparing to our 784×600×600×10 (1230 units).
Finally, Zhai et al. [54] propose a 12×3×1 MLP (4 units) in a Xilinx Zynq SoC to detect and classify the gas sensor data with a processing time of 540ns. Our architecture uses less resources and achieves a ×7 speedup.
As summary, obtained results show remarkable benefits of using the proposed architecture to accelerate FFNN computation, providing a high-end computing platform with superior speed performance, being able to compute the output of large FFNNs much faster than other works in the bibliography. Furthermore, FPGA devices typically require less power than PC or MCU and require a small board to work, providing integration of online FFNN computation in multiple applications. This is especially important in our proposal, where a single chip solution is given, without any additional external memory or additional device which may act as bottleneck.
VIII. CONCLUSIONS
The proposed SYMPA architecture exploits the fact that different types of FeedForward Neural Networks (FFNN) differ only in its interconnection schemes. With this, the computational procedure can be generally defined, no matter the number of inputs or outputs, hidden layers or hidden neurons per layer. It provides a modular procedure for the single chip FPGA implementation of any fully connected FFNN (MLP, AR, LR), no matter of the number of input, outputs, layers or neurons by layer, with the only limitation of the available resources in a device. Its systolic nature and pipelined design make it possible to obtain linear scalability in resource occupation when increasing the number of units in the FFNN, with no resource increase when adding more layers, except for weight memory storage. The architecture uses a single activation function for the whole FFNN, apart from the obvious resource savings, this fact allows customization options for the activation function model and intermediate unit outputs result readout. It is also possible to update weights during normal operation (no device reprogramming required). By using a mixed serial-parallel architecture based on Neural Processing Units (NPE) containing an ALU and a RAM block each, the resulting computation time is a linear function of the number of layers and number of inputs. However, the maximum clock speed is fixed and independent of the FFNN size, it is the number of clock cycles the changing value for different FFNN. Thus, gathering versatility, simplicity and high-performance, the proposed architecture design becomes a clearly viable candidate for its use in practical implementations with standard off-the-shelf hardware.
The hardware architecture combines concepts from matrix computation fundamentals, mixed serial-parallel computer architecture, and specific hardware availability in current FPGA devices as ALUs and distributed RAM. This architecture presents excellent scalability by replicating Neural Processing Elements (NPE), providing local interconnection among adjacent NPEs and reduced global control signals, thus reducing delays and optimizing clock frequency operation. The resource usage has a linear dependency with respect to the size of the largest network layer, i.e. the NPE number (section V-A). Thus, the system can be easily scaled by adding or removing NPE elements connected to a systolic ring with adequate modification of the FSM. Scalability is only limited by the availability of hardware resources though it is possible to create multi-chip FFNN by simple inter-chip connections.
A practical implementation in a Xilinx Virtex 7 FPGA device can host multiple-layer FFNN with up to 3600 units per layer without using external memory, obtaining a high concurrency in computation reaching up to 1980 Giga Operations Per Second (GOPS). The maximum clock frequency achieved is 550MHz using the ReLU activation function, and 490MHz using logistic sigmoid or hyperbolic tangent. It is important to remark that the maximum clock frequency only depends on the activation function used, since NPEs work (independently of the neural network size) at the maximum possible device speed. It is not the normal situation in other designs, where increasing the complexity means a decrease in clock speed. An important analysis of this work is related to the result that a reduced bit word-length for weights is valid for proper FFNN operation. Thus, since memory size can be a limiting factor in large FFNN, using reduced bit size for weights will allow storing more weight values in the same memory size. Other authors use external memory to store weights. However, by utilizing the on-chip memory, there is no RAM interface bottleneck, thus accelerating the whole design. In general, the architecture proposed in this work is significantly different from the aforementioned approaches due to the combination of matrix algebra and resource optimization.
Current research on similar architectures for matrix operations [55] suggests that the proposed design can be easily adapted for Recurrent Neural Networks and Restricted Boltzmann Machines, and additionally, used in combination with backpropagated-based on-chip learning methods.
One area of future work will be the adaptation of the architecture to work with layers containing more units than the number of available NPEs. It can be done by storing partial layer results in additional BRAM block memory and repeating the input feeding to the NPEs so that the result is obtained after several iterations (as a 'time vs. size' trade-off). It is also straightforward to adapt the architecture for very large FFNN by using multiple devices since the communication between chips would be very simple, expanding the application to any deep learning application where FFNN contain multiple layers with a large number of units per layer. Since the speed of operation is limited by the maximum chip frequency, future FFNN implementations in other devices would increase the performance. His work is related to digital hardware design (embedded systems) for digital signal processing, artificial intelligence and control systems, especially targeted for biomedical engineering, and bio-inspired systems. He also works on neuromorphic hardware and automation systems. VOLUME 7, 2019 
