Abstract -This work presents an on-chip learning of artificial neural networks in a FPGA multiprocessor system, where each neuron is implemented in a soft-core processor. In order to take maximum advantage of the distributed architecture, a pipelined version of the on-line back-propagation algorithm is used, providing a high degree of parallelism between neuron layers and, hence, a higher speed-up in relation to a sequential implementation.
INTRODUCTION
Hardware implementation of Artificial Neural Networks (ANNs) learning can be accomplished with high degree of parallelism, since weights within the same neuron layer can be updated independently. In spite of most approaches to learning relying on sequential software implementations, demand resulted from recent higher dimensional problems, such as those related to Bioinformatics and the Internet, has motivated a new wave of interest for high performance physical implementations in the last decade [1, 2] . The well known back-propagation algorithm [3] has being widely used for on-chip learning benchmarking, since it allows exploring the inherent parallelism of the neural network structure. One of the possible approaches to parallel implementation is to pipeline updates and information flow between layers in the forward and backward phases [4, 5] .
This work presents an FPGA implementation of a multi-layer ANN on a multiprocessor architecture where each neuron is implemented on a Nios II soft-core processor [6] and shared VHDL components yield communication and data transfer among them. In order to take maximum advantage of the distributed architecture, the Pipelined On-line Back-propagation (PBP) algorithm [4] is used in such a way that all the neurons of the network work simultaneously, what provides higher speed-up in relation to a sequential implementation.
The algorithm : Without loss of generality, the algorithm will be described for a network with one output and one hidden layer. In order to obtain parallelism among the layers, while the "output-neuron" computes the outputs h i [t] -provided by the "hiddenneurons" -the hidden-neurons themselves process the next input sample x k [t + 1]. In the beginning of each iteration t, the output-neuron feeds back the errors δ i [t − 1], calculated in the recently completed iteration, and also receives the outputs from the hidden neurons. This exchange of data requires a synchronization mechanism between layers. In Fig. 1 , a sequential diagram detailing the tasks of each layer during the learning process is presented.
According to the algorithm, the weights w ik [t] of the hidden-neurons are therefore updated with a delay of one iteration:
(1) where e is the error, f (·) is the derivative of the activation function f (·), w ji are the output-neuron weights, η is the learning rate, and u i and u j are, respectively, the linear outputs of the hidden and output neurons.
THE MULTIPROCESSOR ARCHITECTURE
In order to show the performance of the pipelined multiprocessed neural network, the classical XOR problem was chosen. The selected network structure has 2 inputs, 2 hidden neurons and one output neuron. Fig. 2 shows a schematic view of the implemented hardware system. Each neuron is implemented on a Nios II embedded processor [6] and, since they are soft-cores processors, each one is duly configured to attend the needs of the application.
In order to exchange data, the hidden-neurons (CPU 1 and CPU 2) are connected, independently, to the output-neuron (CPU 0) by means of shared VHDL components (forward and backward "memory components"), and semaphore techniques allow the necessary mutual exclusion on their access. An architecture counting on on-chip memory blocks for transfer of data, protected by mutexes, was also implemented, but this approach has resulted on lower performance while demanding more chip space. The shared components are organized in pairs for each hidden-neuron in order to distribute the flux of data among processors, reducing transfer overhead. Only hidden-neurons write data in forward memory components, and, likewise, only the output neuron writes in backward memory components. Processors and shared components are embedded on a Cyclone II 2C35 Altera's FPGA. However, instructions, data, stack and heap of each CPU are stored in off-chip memories. With the objective of improving parallelism, whereas hidden-neurons share a DDR SDRAM memory, a SSRAM chip -faster -is dedicated to CPU 0, since this last processor has a higher processing overload. Instructions and data on-chip caches (icache and dcache, respectively) are also introduced in each processor in order to minimize access time to the referred external memories, also reducing conflicts among hidden neurons.
System code and configuration settings are downloaded from a host computer to the FPGA and memory chips -located in a Nios II Development Kit [6] -by means of a JTAG UART connection. Soon after training is finished, the same channel is used to save the final weights and network error history -mean square error per epoch -on the host computer, so that the performance of the system can be evaluated.
RESULTS
The PBP implementation was compared with a standard sequential back-propagation (SBP) running on CPU 0 -the most powerful among the three CPUs. In order to obtain a comparison between the pipelined and the sequential implementations, two different metrics were used. The first of them is the convergence speed C s , that is given by the number of epochs completed per time unit, or C s = ne t . The second metric is related to the parallelism degree P d of the implementation and is given by the amount of time a hidden neuron is on hold, within one iteration, waiting for the output neuron at the synchronization point, or:
where t h is the time on hold and t is the total time of the epoch. All the metric results presented were averaged over 10 trials.
The best speed-up -rate between convergence speeds of distributed and sequential implementations -measured was obtained for a system running at 100 MHz (demanding 16210 logic elements and 34560 RAM blocks bytes), wherein CPU 0 differs from the others only by the dcache size (2x larger). The degree of parallelism obtained was P d = 0.75 ± 0.01%, what indicates that only in 25% of the time the hidden neurons were on hold. This degree of parallelism resulted on a speed-up:
of 2.03 ± 0.03, i.e. the PBP (presenting a convergence speed of 180.16 ± 1.74 epochs/sec) was about 2x faster then the SBP for the current problem. S up is, of course, expected to rise as the number of neurons in the hidden layer increases for solving higher complexity problems. Convergence behaviour can be observed for both PBP and SBP in Fig. 3 . Although the delay between updates, given by equation (2), has caused a non-smooth error curve, it has not affected the overall convergence performance. 
CONCLUSIONS
The implementation presented provides an alternative co-design topology to embed adaptive neural-based systems. A multiprocessor architecture allows better distribution of tasks and an efficient implementation of the PBP algorithm. In addition, when working with soft-cores, system description becomes easier and faster. Providing on-chip learning, the solution is useful for adaptive control and system modeling for real-time applications. Performance and parallelism can be yet improved by distributing the output-neuron tasks among co-processors as more hidden-neurons are eventually necessary. The flexibility of the Nios II architecture and the built-in facilities to deal with concurrent processes paves the way for further more complex solutions as more chip space is available.
