Abstract. Training of large scale neural networks, like those used nowadays in Deep Learning schemes, requires long computational times or the use of high performance computation solutions like those based on cluster computation, GPU boards, etc. As a possible alternative, in this work the Back-Propagation learning algorithm is implemented in an FPGA board using a multiplexing layer scheme, in which a single layer of neurons is physically implemented in parallel but can be reused any number of times in order to simulate multi-layer architectures. An on-chip implementation of the algorithm is carried out using a training/validation scheme in order to avoid overfitting effects. The hardware implementation is tested on several configurations, permitting to simulate architectures comprising up to 127 hidden layers with a maximum number of neurons in each layer of 60 neurons. We confirmed the correct implementation of the algorithm and compared the computational times against C and Matlab code executed in a multicore supercomputer, observing a clear advantage of the proposed FPGA scheme. The layer multiplexing scheme used provides a simple and flexible approach in comparison to standard implementations of the Back-Propagation algorithm representing an important step towards the FPGA implementation of deep neural networks, one of the most novel and successful existing models for prediction problems.
Introduction
Artificial Neural Networks (ANN) [12] are mathematical models inspired in the functioning of the brain that have been successfully applied to clustering and classification problems in several domains. The BackPropagation algorithm (BP) introduced by Werbos in 1974 [46] and popularized through the work of Rumelhart et al. [38] is the most used learning procedure for training feed-forward neural networks (FFNN) architectures for its application to classification and regression problems. It is a gradient descent based method that minimizes the error between targets and network outputs, computing the derivatives of the error in an efficient way [25, 36] . As a gradient descent algorithm the search for a solution can get stuck in local minima but in practice the algorithm is quite efficient, and as such it has been applied to a wide range of areas like pattern recognition [23] , medical diagnosis [37] , stock market prediction [35] , etc.
Even if with the actual computational power it is possible to train neural networks models relatively fast, using large architectures and/or large patterns data sets may require the use of parallel strategies to speed up the training process. In particular, a recent popularized model known as Deep Learning and usually applied to large training data sets, relies in a training process that may take several days or even weeks to be completed [5, 17] . In this sense alternatives based on cluster computing, GPUs and FPGAs are sensible strategies, each of them having their benefits and drawbacks [10, 29, 43, 44] . In particular, Field Programmable Gate Arrays (FPGA) [18] are reprogrammable silicon chips, using prebuilt logic blocks and programmable routing resources that can be configured to implement custom hardware functionality. Neuro-inspired models of computations have a very large degree of parallel processing of the information, and as such one the main advantages of FPGA over previously mentioned alternatives (cluster computing and GPUs) for implementing them is the fact of their intrinsic parallelism. On the other hand, programming FPGA is relatively more complex than the other models, and this fact might explain that they have not been much utilized yet for Deep Learning.
Several studies have analyzed the implementation of neural networks models in FPGAs [19] [20] [21] 24, 32] , applying one of the two existing alternatives for their implementation: off and on chip. In off-chip learning implementations [13, 30] the training of the neural network model is performed externally usually in a personal computer (PC) to which the FPGA is attached, and only the synaptic weights are transmitted to the FPGA that acts as a hardware accelerator. On the other hand on-chip learning implementations includes both training and execution phases of the algorithm [6, 32, 41] permitting the whole process to be carried out in the FPGA board independently of an external device. Existing specific implementations of the artificial neural network Back-Propagation algorithm in FPGA boards include the works of [9, 29, 31, 39] . In all of these works, the neural network architecture is previously prefixed by the designer, as the number of neurons and hidden layers is limited by the FPGA resources available. Although recent advances on the computational power of these boards have permitted an increase in the size of the architectures, the number of layers that can be implemented is still limited, and as said above also this number should be prefixed before its application.
For the previously mentioned reason, in this work a layer multiplexing scheme for the on-chip implementation of BP algorithm in a VIRTEX-5 XC5VLX110T FPGA board is introduced. This scheme consists in implementing physically a single layer of neurons that can be reused any number of times in order to simulate architectures with any number of hidden layers [13] . The number of hidden layers that can be used in a neural architecture is only limited by the temporal constraint related to the execution time and to a maximum of 127 because of the memory resource design, although it has been observed previously [3, 8] and confirmed in the present work that the performance of the BP algorithm is drastically reduced when the number of layers is too large. In this respect, a new promising field named Deep Learning has attracted the attention of several researchers and companies in recent years, due to the great success of deep neural networks architectures in several pattern recognition contests [14, 22, 40] . Deep learning schemes requires the use of additional strategies or modifications to the standard BP algorithm in order to be applied successfully [45] , but so far all existing alternative requires heavy computational resources. The aim of this work is to build a simple and flexible implementation of the BP algorithm that may permit to simulate deep BackPropagation neural networks efficiently, contributing to their study and application, and also opening new strategies towards the simulation of deep learning neural networks.
The organization of the present work is as follows: next section includes relevant implementation details about the BP algorithm. The FPGA implementation is described in Section 3, that contains the technical details of the implementation. The work continues with a result section where the implementation is tested and characterized, and finishes with the discussion and conclusions.
The Back-Propagation algorithm
The Back-Propagation algorithm is a supervised learning method for training multilayer artificial neural networks, and even if the algorithm is very well known, we summarize in this section the main equations in relationship to the implementation of the BackPropagation algorithm, as they are important in order to understand the current work.
Let's consider a neural network architecture comprising several hidden layers. If we consider the neurons belonging to a hidden or output layer, the activation of these units, denoted by y i , can be written as:
where w ij are the synaptic weights between neuron i in the current layer and the neurons of the previous layer with activation s j . In the previous equation, we have introduced h as the synaptic potential of a neuron. The activation function used, g, is the logistic function given by the following equation:
The objective of the BP supervised learning algorithm is to minimize the difference between given outputs (targets) for a set of input data and the output of the network. This error depends on the values of the synaptic weights, and so these should be adjusted in order to minimize the error. The error function computed for all output neurons can be defined as:
where the first sum is on the p patterns of the data set and the second sum is on the M output neurons. z i (k) is the target value for output neuron i for pattern k, and y i (k) is the corresponding response output of the network. By using the method of gradient descent, the BP attempts to minimize this error in an iterative process by updating the synaptic weights upon the presentation of a given pattern. The synaptic weights between two last layers of neurons are updated as:
where η is the learning rate that has to be set in advance (a parameter of the algorithm), g is the derivative of the sigmoid function and h is the synaptic potential previously defined, while the rest of the weights are modified according to similar equations by the introduction of a set of values called the "deltas" (δ), that propagate the error from the last layer into the inner ones, that are computed according to Eqs (5) and (6) .
The delta values for the neurons of the last of the N hidden layers are computed as:
The delta values for the rest of the hidden layer neurons are computed according to:
Training and validation processes
The training procedure is executed a certain number of times (epochs) using the training patterns. In one epoch, all training patterns are presented once in random ordering, adjusting the synaptic weights in an online manner. A well known and severe problem affecting all predictive algorithms is the problem of overfitting, caused by an overspecialization of the learning procedure on the training set of patterns [11] . In order to alleviate this effect, one straightforward alternative is to split the set of available training patterns in training, validation and test sets. From these sets by the application of Eq. (3) training, validation and generalization error measures are obtained, measures that will be denoted as E tr , E val and E gen respectively. The training set will then be used to adjust the synaptic weights according to Eq. (4), while the validation set is used to control overfitting effects, storing in memory the values of the synaptic weights that have so far led to the lowest validation error, so when the training procedure ends, the algorithm returns the stored set of weights. The test set is used to estimate the performance of the algorithm in unseen data patterns. The generalization ability (Gen) defined as Gen = 1 − E gen is a standard measure for the prediction accuracy of an algorithm, obtaining its optimal value for Gen = 1 when E gen = 0.
FPGA layer multiplexing scheme implementation of the BP algorithm
The hardware implementation of the Back-Propagation algorithm is divided in 3 different processes: the computation of the output of the neurons (S values), the calculation of the deltas of each neuron (δ), and the synaptic weight updating procedure. Given the logic of the Back-Propagation algorithm, in which the S values are obtained in a forward manner (from the input towards the output) while the deltas are computed backwards, and that finally the weights updating is executed with the values previously obtained, the three processes are sequentially implemented. It is possible to perform the weight updating phase at the same time that the deltas are computed but we have preferred to separate all three processes to obtain a clearer design.
The S values of every layer are obtained as a function of the S values of the previous layer neurons except for those from the first hidden layer which processes the information of the current input pattern. On the contrary, the δ values are computed backwardly, i.e., the δ values associated to a neuron belonging to a hidden layer are computed as a function of the δ values of the a deeper hidden layer, except for the last hidden layer which computes its δ values as a function of the error committed on the current input pattern (cf. Eqs 5-6). The updating process is carried out with the S and δ values of every layer, so it is necessary to store these values when they are computed to be used for the system when they are required. Thus, the structure of the Back-Propagation algorithm allows the whole process to be implemented using a layer multiplexing scheme but noting that forward and backward phases should be considered separately, as S and δ values cannot be computed in a single forward phase.
The deep design of the Back-Propagation algorithm is based on a layer multiplexing scheme in which only one layer is physically implemented being reused 3 × N times in order to simulate a whole neural network architecture containing N hidden layers. Figure 1(a) shows the standard design of a feed-forward neural network architecture where it can be observed how the flow of information goes from the input towards the output, in a way that the input of a layer of neurons is the output of the previous layer. In a layer multiplexing scheme, as shown in Fig. 1(b) the same whole process is carried out but by reusing the structure of the single implemented layer.
The implementation of the layer multiplexing scheme requires a precise control of which layer is simulated in every moment, and for this reason a register called "CurrentLayer" is used. For each pattern, the process starts with the forward phase in which the output of the neurons are computed in response to the input pattern. This first phase starts by introducing an input pattern in the single multiplexing layer and by setting the variable "CurrentLayer" set to 1. Then the neurons' outputs are computed, stored in the distributed RAM memory and transmitted back to the input to calculate the following layer outputs, and thus the variable "CurrentLayer" is increased. The same process is repeated sequentially until the "CurrentLayer" value is equal to the maximum number of layers, previously defined by the user and stored in the "MaxLayer" register. When the last layer is reached the neurons output is computed together with the error committed in the pattern target estimation and these error values are stored in a register for its use in the second phase. The second phase involves the backward computation of the delta values, and the first computation involves the calculation of the delta values of the last layer. Once these values are obtained, they are backwardly transmitted to the previous layer in order to compute the delta values for these set of neurons, according to Eq. (6). With these delta values a recurrent process is used to obtain the delta values of the rest of the layers until the input layer values are obtained ("CurrentLayer = 1"). At this point the third phase is carried out in order to update the synaptic weights, and finishing one pattern iteration of the process.
In the following subsections details of the hardware implementation are given. Figure 2 shows the general structure of the hardware implementation of the Back-Propagation algorithm. The whole structure has been divided in two main modules in order to separate the communication protocol and the memory resources of the FPGA (external block) from the module where the implementation of the algorithm is carried out composed by the Architecture and Control blocks.
Hardware implementation design
The External block logic depends specifically on the type of communication protocol chosen for receiving the pattern data set and transmitting the synaptic weights of the resultant model. The implementation of the present algorithm in different applications and boards require the use of a different external block, so instead of giving specific details about it, we have preferred to describe its functionality that it might be more helpful for future implementations.
In our case, the communication between the PC and the FPGA was handled using a serial communication protocol through the RS-232 port of the board in order to manage the exchange of information. The reason for this choice is that it can be implemented in VHDL and ported to other architectures quite easily in comparison to other possibilities.
The functionality of the external block is separated in two different processes: The first one was in charge of storing and managing the input training data set in order to present a different pattern every time that the control block requires it; while the second process was used for storing the synaptic weights once the learning process finishes (see Fig. 2 ). The hardware implementation of these two processes involves taking into account a series of signals between the blocks that are described in the appendix.
The internal module computes the neural model output and modifies the synaptic weights according to the training data presented to the network architecture. This module carries out the whole process of the algorithm and is composed by the control and architecture blocks described below in Sections 3.1 and 3.2 respectively.
Control block
The control block organizes the whole information flow process within the FPGA board by sending and processing the information from the architecture and pattern blocks. The structure of this block is organized around two main processes: i) Network Training: the main function of the control block is to manage two activation signals that indicate whether a training or a validation pattern should be sent to the architecture block. In order to perform this action the control block receives a signal value from the pattern block that indicates the total number of training (#Train) and validation (#Val) patterns set for the training procedure. ii) Validation: a secondary process of the control block regards the use of a validation set for monitoring the training error, in order to control overfitting effects. In essence, this process computes an error value using the validation set of patterns to store the synaptic weight values that have led to the smallest validation error while the training of the network proceeds. At the end of the training phase, this module retrieves the set of weights that had led to the minimum validation error. The implementation of the whole validation process in the FPGA is detailed in Section 2.
When the computations start, the set of training patterns are loaded into the external block that sends a signal to the control block in order to start the execution of the algorithm. Figure 3 shows a flowchart of the control block operations. At the beginning of the process, a set of counters related to the number of training and validation patterns, and number of epochs are initialized to zero. While the number of training patterns for a given epoch is lower than the set value of training patterns (#Train), the training procedure keeps sending a signal to the pattern block indicating that a random chosen training pattern should be sent to the architecture block. The architecture block will then train the network, sending back a signal (Ready_Train) to the control block when the training of this pattern finishes, increasing the trained pattern counter Count1. When the value of this counter gets equal to the total number of training patterns, then the validation process start. The previous steps belong to a loop so they are repeated until the maximum number of epochs (#Epoch) is reached.
Architecture block
The Fig. 4 shows a scheme of the architecture block that performs the layer multiplexing procedure by physically implementing a single layer of neurons. This single layer is composed of A neurons blocks in parallel implemented in order to compute the neuron's output (S) and the δ values, that will later be used for the update of the synaptic weights. The value of A (limited by the board resources) will be the maximum number of neurons for any hidden layer. The neuron blocks manage their own synaptic weights independently of the rest of the architecture, and thus they require a RAM block attached to them. Details of the neuron blocks are described below in Section 3.2. The architecture block also includes memory blocks to store the S and δ values computed for every layer and also for the different input and output signals that are described below.
The input signals are the pattern to be learned, the signal that indicates a new pattern is introduced (New_pattern), the configuration and control data sets, including also the S and δ values. The configuration data set includes the parameters set by the user to specify the neural network architecture, including the number of hidden layers, the number of neurons in each of these layers, learning parameters, etc. The control data set are signals that the control block needs for managing the process of the algorithm to activate the right procedure in every moment. The output signals comprise the output (S) and the δ values for every layer, the training error of the current pattern, and the ready signals for the validation and training processes which are integrated in the control data set.
The maximum number of layers has been determined by the size of the bus used to address the multiplexing layer scheme. In order to have a compromise between the number of layers and used resources, we have employed 7 bits in this bus, so the maximum number of layers is 2 7 −1 = 127. Also, the maximum number of layers is delimited for the resources needed to store the synaptic weights, according to the next equation:
where N i is the number of inputs, N 1 is the number of neurons in the first layer, N 2 those corresponding to the second layer and so on.
Neuron block
Each of the A neuron blocks manages its own synaptic weights, computing the S and δ values involved in the Back-Propagation pattern updating procedure. The word length to be chosen for representing the synaptic weights would depend on the available resources, taking into account that obtaining a higher accuracy requires a larger representation, which will imply an increase in the number of LUTs per neuron (consequently a reduced number of available neurons) and a decrease in the maximum operation frequency of the board. A synaptic weight is represented by a bit array Table 2 , while the maximum number of neurons is 60 (see Section 5) .
The implementation of the neuron blocks has been performed by dividing all the involved processes in five main sub-blocks (Multiplier, Weight S, δ and Update blocks). We describe below the detailed implementation of each one of these sub-blocks.
Multiplier block
The multiplier block computes the multiplication operations involved in the Back-Propagation algorithm, mainly between neuron activations and synaptic weights values (see Eqs (1)- (6)). An efficient implementation of this operation is crucial in order to optimize the board resources. A time-division multiplexing scheme has been developed for an efficient use of the resources, using only one multiplier per neuron and thus performing sequentially the computation of several products [34] . Multipliers can be implemented by shifters and adders, following the approach presented in [4] or by available specific DSP cores in the FPGA. The DSP based strategy has been selected because the system frequency in the FPGA can be up to four times faster. The DSP uses a frequency two times larger than the used by the neuron block, so that a product operation could be completed in one operation cycle of the FPGA. Figure 5 shows the multiplier block. A "state" signal will indicate which of the processes (S, δ or updating) is being executed at this moment, and two multiplexers will select the correct values to a DSP multiplier, which synchronized with a clock signal will send the multiplication result to the rest of the blocks.
Weight block
Each of the weight blocks attached to every neuron is in charge of writing and reading the synaptic weights using a single distributed RAM memory module. The memory module has three inputs (W/R, Addr, and Value) managed by two multiplexers controlled by the signal "state". The first input (W/R) decides which action to carry out (write or read), while the second input (Addr) specifies the memory address, and the third inputs is the value to store in case of a writing operation. The first multiplexer (the bottom one in the figure) allows writing (W/R = 1) only when the "state" signal is Update, otherwise only reading (W/R = 0) is possible. The second multiplexer (the one on top) selects the address which will be used for the read/write operation according to the "state" signal (S, δ, U ). The memory module also uses a frequency two times larger than the used by the neuron block in order to complete the operation in one cycle of the FPGA.
S block
The S block (see the right part of Fig. 5 ) computes the output of a neuron as a function of the outputs of the previous layer which is introduced by the signal "Vec_S". The FSM (Finite State Machine) of the block manages the steps required by the process (see the top of the figure of the S-block). When the S process starts, the control block activates the signal "Enable_S" and then the FSM change its stand-by state (A = 0) to the A = 1 state.
In State A = 1 the block is in charge of computing the sum of the product of synaptic weights and input values (Eq. (1)). To perform this operation, two synchronous counters (indicated by #1 and #2 in the figure) are used together with a set of logic elements for performing the summation of product values. The signal "Index" of the adder #2 selects the corresponding S j value and the memory address ("MemAddr_S") of the synaptic weight w ij ("Weight"). These selected values are sent in each cycle to the Multiplier Block using the signals ("Inp1Mul_S") and ("Inp2Mul_S") for the w ij and S j values respectively. The result of the multiplication is returned through the signal "OutMul" and the counter #1 computes the summation of the multiplications until the "Index" signal is equal to the number of neurons in the previous layer (Index = NuexLayer), in this moment the "h" value (synaptic potential) is computed and the FSM changes to A = 2 state.
When the FSM is in the A = 2 state, the S block computes every neuron's output applying the transfer function used (a sigmoid function in this case) to the synaptic potential previously obtained. This procedure, that for FPGA is not as straightforward as in a standard PC, is carried out using a lookup table containing tabulated values of the function plus a linear interpolation scheme. Further details of this procedure have been already explained in detail in Ref. [34] . Once the S value is obtained, the error for the estimation of the current pattern is computed, a ready signal ("Ready_T f") is activated and the FSM change its state to A = 3. In the last state (A = 3), the output signals of the block ("Error", "S" and "Ready_S") are stored in three registers for their further use by other processes. Figure 6 shows the δ block which is in charge of computing the δ values of the current layer. The δ process begins with the activation of signal "Enable_δ" and the FSM of the delta block switches from the initial inactive state (A = 0) to A = 1. In this state, the summation involved in the delta process Eqs (5) and (6) is computed. For this process it is necessary to design a crossed memory access since the δ block of a neuron requires the synaptic weight values of other neurons, and for this reason a decreasing counter (#1) is used. A second counter (#2) selects the memory address ("MemAddr_δ") using the "Index" signal, in a similar way as indicated for the case of the S block.
δ block
The corresponding δ values (δ l+1 i
) and the synaptic weight (w ij ) (see Eqs (5) and (6)) are chosen according to the value of the "Index_δ" signal of the counter #1 using the signals "Vec_δ" and "Vec_weight". The δ and synaptic weights values are sent to the multiplier block by the "Inp1Mul_S" and "Inp2Mul_S" signals respectively in order to compute the required multiplication for the sum of products which is carried out in the synchronous counter #3 by the "OutMul". This operation finishes when the "Index" signal is equal to the identifier of the neuron ("N "), and at this moment the FSM changes its state from A = 1 to A = 2.
When the FSM is in state A = 2, the δ block computes the derivative of the function (S j = S j ·(1−S j )). The right value of S j is selected using the N signal value from the "V ec_S" signals which contains the S value of all neurons. The computations on this state can be done in only one clock cycle, and thus in the next cycle the FSM switch its state to A = 3.
The state A = 3 calculates the multiplication between the result of the summation and the derivatives. This procedure is carried out by the multiplier block by the "Inp1Mul_δ" and "Inp2Mul_δ" signals and the multiplication is returned by the "OutMul signal. In the last state (A = 4), the output signals of the block ("δ" and "Ready_δ") are stored in three registers for further usage.
U pdate block
The update block is in charge of modifying the synaptic weights for the whole architecture after an iteration of the algorithm has been carried out. Figure 6 shows the Update block in which it can be observed a 4-state FSM. The state A = 0 is the resting state that it is modified when the signal "Enable_U " is active. Inside a loop the states A = 1 and A = 2 of the FSM are used for reading the current synaptic weights and writing the updated ones for all simulated hidden layers (Layer = NumLayer). The FSM switches to state A = 3 when the whole process is finished.
In state A = 1, the Update block carries out the request of every synaptic weight using a counter that generates the signal "Index" for reading the stored synaptic weights, in an analogous way as it has been explained before for the S Block. Also, in this state the multiplication between S i and δ i (see Eq. (4)) is performed. Both values are selected from their respective vectors "Vec_S" and "Vec_δ" using two multiplexers controlled for the "Index" signal.
When the FSM is in state A = 2, the multiplication between the learning rate (η) and the result of the multiplication obtained in the previous state is computed. This result is added to the current synaptic weight value to obtain the updated values. In order to store these values the Update block must activate a write/read signal ("W/R_U ") to activate the input W/R of the memory module. In the last FSM state (A = 3), the Update block activate a signal indicating that the whole process is completed ("Ready_U ").
Results
We present in this section results from the implementation of the Back-Propagation algorithm in a Xilinx Virtex-5 board. Table 2 shows some characteristics of the Virtex-5 XC5VLX110T FPGA, indicating its main logic resources. VHDL [2, 4] (VHSIC Hardware Description Language) language was used for programming the FPGA, under the "Xilinx ISE Design Suite 12.4" environment using the "ISim M.81d" simulator. The operation system frequency was increased from the 100 MHZ board oscillator frequency to 200 MHZ through the use of a PLL, as the efficiency of the code allowed this configuration.
To verify the correct FPGA implementation of the model, several test cases were analyzed comparing the results with those obtained from C and Matlab implementations and with previously published results. In particular, to assess the advantages of using an FPGA board, we compare the results testing several network architectures under C and Matlab programming languages executed in the Picasso cluster that belongs to the Spanish Supercomputing Network. 1 The cluster is formed by a set of computation nodes unified behind a single Slurm queue system, consisting mainly of 7 HP DL980 nodes of 80 cores and 2 TB RAM computers, 32 HP SL230 nodes with 16 cores and 64 GB of RAM, 42 HP DL165 nodes with 24 cores and 96 GB of RAM, and 16 HP SL250 nodes with 2 GPUs each, totalling 63 TFLOP/s. Own generated code in C and Matlab languages were used for the comparison, noting that the C programming language is considered among the fastest that can be used in a PC [7, 16] while Matlab is a language optimized for operations involving matrices and vectors useful for neural network implementations [27, 42] . All the tests were carried out using a 50-20-30 splitting for the training, validation and generalization sets respectively, with a learning rate (η) value fixed to 0.2, and using data from the well-known Iris set [28] . The generalization set contains only patterns not used during the learning process, and it is used to test the prediction capacity of the algorithm, known as Generalization ability (Gen). Figure 7 shows the evolution for the training (Etr) and validation (Eval) errors for the FPGA and the multicore (MC) cluster based implementation. The architectures used contained one hidden layer (a), two (b), and three (c), including five neurons in all hidden layers, and three neurons in the output that corresponds to the three classes of the Iris problem. In all three graphs, two vertical lines indicate the time at which the mini- mum of the validation error is obtained, point when the Generalization ability (Gen) is measured for both implementations (the obtained values are also indicated in the graph). It can be appreciated from the error curves that for the FPGA implementation case some larger oscillations appear, and this is due to rounding effects because of the size of the fixed point representation used. In terms of the level of prediction accuracy obtained these oscillations do not degrade it, and on the contrary in some cases even leads to larger values, as it has been observed previously in FPGA implementations [31, 33] , and in several works where it was concluded that certain level of noise might be beneficial for improving learning times, fault tolerance and prediction accuracy [1, 15, 26] . Table 3 shows the generalization ability obtained for several architectures with different numbers of hidden layers for the FPGA and MC implementations. The first column indicates the number of hidden layer present in the architecture, the second column shows the generalization obtained using the MC implementation (mean and standard deviation computed over 100 independent runs using C code), while third and fourth columns shows the results for two different FPGA implementations: the layer multiplexing scheme proposed in this work and the fixed layer scheme utilized in Ref. [31] (only available for architectures with one and two hidden layers). The number of neurons in each of the hidden layers was fixed to five and the number of epochs set to 1000. The results clearly show that for architectures with 15 or more hidden layers the generalization ability gets much reduced, reaching a random expected value for a problem with three classes.
From the results shown in Table 3 it can be seen that the obtained values for generalization are approximately similar for the three implementations considered, and that regarding the number of hidden layers present in the neural architectures the performance of the BP algorithm is relatively stable for architectures with up to 5 hidden neuron layers point from which the generalization accuracy starts to decrease to reach the level expected for random choices for a number of layers equal to 15. Figure 8 shows the computation times (in μs and in logarithmic scale) and number of clock cycles (#cc) involved in the three processes related to the operation of the BP algorithm for FPGA, MC-C, and MCMatlab implementations: Output, Delta and Updating processes when learning a single pattern. The graph shows the results for one, ten and twenty hidden layers with five neurons per layer. The expressions shown on top of the figure for the number of clock cycles in- ) show execution times for the complete BP learning procedure (in seconds in a logarithmic scale for 1000 epochs) for the FPGA implementation and MC-C ( Fig. 9(a) ) and MC-Matlab ( Fig. 9(b) ) as a function of the number of neurons for one and ten hidden layers architectures (this number of neurons is fixed for all layers). In both graphs it is also shown the number of times that the FPGA implementation is faster in comparison to the MC one for C and Matlab code respectively (see the right y-axis scale). We also show in Fig. 10 the number of times that the FPGA implementation is faster than the MC-C (Fig. 10(a) ) and MC-Matlab (Fig. 10(b) ) codes as a function of the number of layers in the deep architectures for different number of neurons in each of these layers.
In order to obtain a fair comparison between the FPGA and PC the computation time of the FPGA has been measured without taking account the communication time due to this time could change depending on the type of protocol used in the communication. So, the computation time is calculated from the input data are completely sent to the neural computation model is computed in the FPGA.
From the results shown in Figs 9(a) and 10(a), it can be seen that the number for times that the FPGA implementation is faster than the MC-C increases linearly with the number of neurons in each of the lay- ers, reaching 27 times for the case of using 60 neurons in each layer, noting that these values are kept constant for different number of hidden layers. In relationship to the computational times between the FPGA and the MC-Matlab implementation the advantage of using the FPGA decreases as the number of neurons in each layer increases, and this effect can be explained because Matlab uses matrix-based computations that are more efficient for heavier computations, but noting that the number of times that the FPGA is faster than MC-Matlab converges asymptotically to 60 times approximately.
To test the correct implementation of the deep learning scheme of the BP algorithm in the FPGA board, we measured training, validation and test errors on a set of benchmark problems from the UCI database [28] frequently used in the literature. Table 4 shows the accuracy (generalization ability) values obtained for both implementations of the algorithm for eleven bench- mark problems. The first three columns indicate the data set name, number of inputs and outputs respectively, while the last two columns shows the generalization ability obtained using neural network architectures with 5 neurons in the single hidden layer. This choice of number of neurons permits the comparison with published results [31] . For carrying out the simulations a training, validation and test sets splitting was used in a 50-20-30% scheme; in which the validation set was used to find the number of epochs for evaluating the test error, the maximum number of epochs was set to 1000, and the learning rate was equal to 0.2. 100 independent runs were computed for each benchmark data set and the average and standard deviation of the obtained results are reported in the table. The results indicate a correct functioning of the algorithm, noting that the small observed differences can be related to the methodology of computation and to the different number representation used in the two analyzed cases.
Discussion and conclusions
We have successfully implemented the Back Propagation algorithm in an FPGA board using a novel layer multiplexing on-chip learning scheme that includes a validation procedure in order to prevent overfitting effects. The layer multiplexing scheme utilized permits to simulate a several hidden layer neural network with only implementing physically a single hidden layer of neurons. The main advantage of this approach is that very deep neural network architectures can be analyzed through a simple and flexible framework with very efficient resource utilization. A modular design has been utilized in a hardware implementation that incorporates strategies like multiplexing of the multipliers, optimized memory access and efficient data type representation, with the aim of producing a flexible and resource efficient tool for the study of multi-layer neural architectures.
In terms of computational times, the implementation has been tested and compared to multicore (MC) C and Matlab codes executed in a 97 nodes supercomputer. In comparison to the MC-C code, the number of times that the FPGA implementation is faster increases linearly as the number of neurons in each of the layers increases, while being almost constant for different number of hidden layers, reaching a value of 27 when 60 neurons are included in each of these hidden layers. The same comparison but for the case of FPGA and MC-Matlab implementations shows a different behaviour as the advantage of the FPGA decreases as the number of neurons in each layer increases. This advantage also has a slight decrease as the number of layers is increased, but in all analyzed cases being larger than 58.8 times.
The layer multiplexing scheme used permits in principle the simulation of networks with any number of hidden layers, but due to memory resource design the maximum number of layers in the current implementation is 127. Regarding the maximum number of neurons allowed in each of the hidden layer, hardware resources of the FPGA board used (VIRTEX-5 XC5VLX110T) pose a limit of 60 neurons. The results obtained confirm the degradation of the Back-Propagation algorithm for very deep architectures comprising 15 or more hidden layers, as almost a random behavior is obtained for deeper networks (see Table 3 ). Understanding and improving the training of deep architectures is a big present challenge, and we believe that the present work may contribute to their understanding as we have introduced a flexible tool for carrying this analysis that we plan to tackle in the near future. Further, it is worth noting that so far FPGAs have not been much applied to Deep Learning approaches, and we believe that high developing times are to blame. In this sense, we hope that this work can help other researchers on the applicaction of FPGA based approaches, as the intrinsic parallelism of these devices makes them a suitable technology for implementing neuroinspired models.
-Enable_Store: A pulse when the synaptic weights of the neurons must be stored in the external block. -End: Active when the process is finished.
-Ready_Sent: A pulse that is sent when the external block has finished sending the synaptic weights to the external device. -Ready_Store: A pulse that is sent when the external block has finished storing the synaptic weights of every neuron. 
