Abstract -The paper presents a method for FPGA implementation of Self-Organizing Map (SOM) artificial neural networks with on-chip learning algorithm. The method aims to build up a specific neural network using generic blocks designed in the MathWorks Simulink environment. The main characteristics of this original solution are: on-chip learning algorithm implementation, high reconfiguration capability and operation under real time constraints. An extended analysis has been carried out on the hardware resources used to implement the whole SOM network, as well as each individual component block.
Libraries design approach
The strategy used to design the neural network library relies on building up a Simulink integrated hardware-software platform, using Simulink Xilinx blocks, MCode blocks (Matlab files) and Black Box blocks (VHDL files).
The SOM algorithm
The SOM network belongs to the intelligent pattern analysis techniques capable to cope with non-linear data, with advantages over the conventional methods (like linear discriminate technique, principal component analysis or cluster analysis), such as learning capabilities, self-organising, generalisation and noise tolerance. An SOM networks consist of one or two-dimensional output layer of neurons (rare multidimensional layers) in addition to an input layer of branched nodes, mimicking more closely the neural structures of the human brain than other neural networks. The basic principal in training the SOM network is the weights adjustment of only those neurons from a specific neighbourhood area, given by the neighbourhood function (Gaussian or Mexican hat, Fig. 1 ). In this way, the network constructs an internal representation of the environment (application), [17] . Interaction around the winning neuron as a function of distance
The algorithms followed to develop the appropriate block set of the SOM neural network are described in the following section The SOM algorithm used is based on the modified Gaussian -neighbourhood, Euclidean distance rectangular topology. The mathematical modelling of the SOM behaviour is schematically shown in the following and comprises roughly two major steps, [6] :
1. identification of the winning neuron based on modified Gaussian distance function:
2. followed by the weights updating of those neurons which are counted as neighbours of the winning neuron:
, _ 
The neighbour selection is made using the following function:
a. the neighbourhood function, h(t
b. the learning rate function t)
c. the neighbourhood size computing function,(t)
where: p(t) is the presented pattern at epoch t, w i (t) is the weight vector of neuron i at epoch t, 2  is the 2-dimensional Euclidian distance, d is the smallest Euclidian distance between the pattern and neurons,  is the number of epochs in phase 1 (default: 1000), (0) is predefined (default: 0.9) and (0) is calculated based on the network size (for two dimensional case (0) is given by (6)).  is predefined (default: 0.02) and is 1.00001.
where row is the number of the neurons on horizontal direction and col is the number of the neurons on vertical direction.
SOM library's blockset design
Choosing the optimal network parameters and training procedures for a given application is always a challenge, and developing customizable blocks for a neural network capable to work under real time constraints could lead to auto-reconfigurable intelligent systems. The blocks used to build-up the libraries are designed in the Simulink environment and use basic computing Xilinx blocks and VHDL (Very High Speed Integrated Circuit Hardware Description Language) code implemented as "black box HDL". The SOM implemented consists of 7 neurons in the input layer and a bi-dimensional map with 25 neurons. The parallelism adopted is a node one, which requires managing of all the neurons at the same time (Fig. 2) .
In order to emulate in hardware the SOM architecture, the following computing blocks have been developed: a map control block designed to control the neurons behaviour, a customizable neuron map block that comprises the neurons from the bi-dimensional map, a winner select block that selects the winning neuron, a neighbour select block that decides if a neuron is or not into the neighbourhood of the winning neuron, a we generator block that sets which neuron's weight will be updated and an alpha beta gen block that generates the learning rate and neighbourhood size coefficient according to the actual epochs. The hardware architecture of neural network
The "map control" block
The block is written in VHDL code and implemented through a Black box block. The role of this block is to manage the signals that control the neurons processing elements and the winner select block. It is a customizable block, thus is able to auto reconfigure according to the neurons number of a layer. The neurons number is set as a generic parameter into VHDL code. It determines the upper value of the counter that commands, according to the controlled processing element, the accumulator (reset and enable control) and the RAM block (address control). These processing elements are comprised in each neuron structure.
The neuron's architecture
The architecture of the neurons has been chosen to cover both types of the neuron behaviour: the learning phase (in which the weights are updated according to the presented pattern) and the propagation phase (in which the neural network acts as a pattern recognition machine). It consists of a RAM block used to store the updated weights, a multiply and accumulate unit (MAC), Fig. 3 , or a DSP bloc, Fig. 4 , two multiplexers and an adder block. The multipliers or the DSP modules are implemented either using dedicated blocks: XtremeDSP slices (where the units are available: Spartan 3 and 6, Virtex 4 to 7) or dedicated multipliers or distributed logic, [24] [27] . Therefore, the architecture adopted will depend on the hardware resources available in the targeted FPGA.
In order to make the SOM design more user-friendly, every neuron's parameters like its weights, memory depth and rank in the map are initialized using a user created pop-up window (Fig. 5) . Consequently, for a specific FPGA circuit, if the needed resources exceed the number of XtremeDSP/dedicated multiplier blocks available the neurons will be implemented using the XtremeDSP/dedicated multiplier architecture within the limit of 192 DSPs (for 4VSX35) and for the rest of the neurons a distributed logic resources implementation will be used. To determine the necessary resources for each type of architecture/type of implementation, it has been implemented neurons that use multipliers, XtremeDSP blocks, distributed logics instead of the DSP blocks and distributed logics instead of the RAMB16 and DSP blocks (Table 1) . It can be seen that using distributed logic instead of the XtremeDSP increases the Slices usage with more than 300% and the computation speed decreases with 40 MHz. Therefore, in the following, if the available resources permit, the neuron architecture with XtremeDSP blocks will be used.
The "neuron map" block
The neuron map block consists of a user customizable number of neurons with a node parallelism processing mode, (25 neurons in this case). These are arranged in a square bi-dimensional structure of identical neurons Fig. 6 . The necessary commands for the neuron management are: the accumulator reset command (rst_acc: resets the acc when a new computation is made), enable accumulator (en_acc: enable acc when in use), RAM address (addrW: provide the address of the updated weight), input data (x: applied input pattern), RAM write enable (weW: write enable signal for those neurons which are in a winner's neighbourhood), weights adjustment function (alpfa*h: weight adjustment factor) and the ANN behavioural setting signal (learn/propag: sets the learning or pattern recognition behaviour). Node parallelism processing mode of the neuron map
The resources utilized by the neuron map block function of its size (number of neurons) and precision (number of bits used to represent data) are presented in Table 2 . 
The "winner select" block
Another important block is the "winner select" block. This block finds the neuron with the most similar weights as the presented input. This similarity is measured as Euclidian distance between the corresponding inputs vector and the weights vector. The "winner select" block consists of "winner finder" block and a block named "1to2dim" that transforms the rank of the winning neuron from one-dimensional state into a two-dimensional state. The conversion is necessary because the neighbourhood area is set into a planar space and the computations are made considering the neurons arranged in a single dimension. The "winner finder" block will compare all the Euclidian distances between the input pattern and the neuron weights, in order to find the smallest one (Fig. 7) . The "1to2dim" block is implemented using Xilinx Blockset components and its role is to determine the bi-dimensional indexes of the wining neuron. The proposed architecture is presented in Fig. 8 . Fig. 7 .
The architecture of the "winner select" block Fig. 8 . The architecture of "1to2dim" block The power consumed, the maximum processing frequency and the resources used to implement in FPGA the "winner_finder" block considering different numbers of neurons and precisions are presented in Table 3 .
The "neighbours select" block
This block deals with finding of all the neurons within the winning neuron neighbourhood. The neighbourhood area is calculated as a circle centred on the winning neuron. This means that the distance from each neuron within the circle to the winning neurons is smaller than the circle radius (D 1 , ..., D k ≤ D max ), Fig. 9 . The block calculates the n(k i ,k j ) indices of the neurons that belong to the winning neuron neighbourhood, meaning:
where k i and k j are the row and column indices of the neurons position related to the circle centre in which the winning neuron is situated. Fig. 9 . Setting the neighborhood area
Because the square map is highly symmetrical, the distance is calculated only in one quarter of the circle. The radius value is set by another block (alpha beta gen) and decreases with the increasing of the epochs. The algorithm of the neighbourhood selection is hardware implemented with Xilinx logic blocks and it is presented in Fig. 10 The "neighbours select" block provides also the values of the "j-kj", "j+kj", "i-ki" and "i+ki" in order to be compared with the number of neurons from the square map side, i.e. "k max = (total number of neurons) 1/2 " and respectively the "truth state" of the following arithmetic relational operations "ki=kj=0", "j>=kj", "kj+j<=kmax", "ki^2 + kj^2 <= Dmax^2", "ki+i<=kmax" and "i>+ki". All these values are used in the next block, we generator block, that finally allows or not the changing of a neuron weight.
The power consumed, the maximum processing frequency and the resources used to implement in FPGA the neighbours select block are presented in Table 4 . 
The "alpha beta gen" block
The alpha beta gen block generates the learning rate and the neighbourhood radius function of the set epochs. For an optimal implementation of the equation (4), the 1/τ function is approximated with a piecewise linear function (Fig. 11) . The errors introduced by the adopted approximation are related to the number of the pieces considered for a certain number of epochs (Fig. 12) . As seen in Table 5 , the error ratio decreases with the number of the linear pieces used to approximate the function, but leads to the increasing of the hardware resources. Table 5 presents the errors function of number of pieces introduced at 10,000 epochs. Therefore, for 10,000 epochs, the closest piecewise linear function approximation consists of 5 pieces and the learning rate function for t >  can be described using equation (8), leading to a hardware implementable structure as described in Fig. 13 . The resources used to implement the learning rate block do not depend on the neurons number but the number of pieces used to approximate the function, presented in Table 6 . Fig. 13 . The learning rate block for t >  The learning rate and the neighbourhood size functions, (5) and (6) , are implemented by mean of an epochs number related customizable computing blocks. As SOM network learns in two phases (the ordering and the tuning phase), the architecture uses a comparator block to compare the current epoch with epoch  (the  epoch represents the epoch at which the learning is switched from the ordering phase to the tuning phase and has a predefined value of 1,000), Fig. 14 . In order to implement (5), the neighbourhood size function is implemented by mean of two computing blocks that treat each of the following cases: t >  and t < , as shown in Fig. 15 . The equation (6) is implemented using Xilinx blocks and its architecture is presented in Fig. 16 Fig. 14.
t t t t t t t t t t t
Learning rate and the neighborhood computing blocks Fig. 15 . The learning rate computing blocks Fig. 16 . Hardware implementation of the "beta block"
The hardware resources used to implement the alpha beta gen block are presented as a function of network neurons number and the maximum number of epochs, Table 7 . Analyzing the results, we can say that the hardware resources used to implement the alpha beta gen block do not depend on the neurons number or the epochs' number. 
The "we generator" block
The we generator block compares the distances calculated by the neighbours select block with the current neighbourhood size and sets or resets the write enable of the corresponding weights RAM addresses according to the position of each neuron in the map (the write enable is set, we = '1', if the neuron is in the winning neuron's neighbourhood and reset if it is not). In order to make the calculus in a one-dimensional way, the planar indexes (k i , k j ) are convert into a linear one, k, using the formula stated in (9):
where, i is the raw index, j is the column index and k max is the neurons number per raw (or column). The block verifies if the control signals are 0 or 1 logic (1 logic means that the corresponding index is in the winner's neighbourhood). The outcome is a signal bus n bits wide, where n is equal with the number of neurons from the square map (Fig. 17 ). An ISE report about the utilized resources has been presented in Table 8 . These values are used for the overall resource estimation given in Table 9 . 
The overall hardware implementation
The SOM network with 7 input neurons and 25 output neurons (SOM 7_25) was implemented into DS-KIT-4VSX35MB-G development board featuring the Xilinx Virtex-4 SX XC4VSX35-FF668 FPGA. The report of the overall device utilization was made with the Xilinx Integrated Software Environment (ISE) software and it is presented in Table 9 . By analyzing the hardware implementation reports ( Table 2 and Table 9 ) the following equations have been developed: 
These equations give the possibility to make an aprioristic estimation of the utilized resources in terms of RAMs, LUTs and DSP blocks for a specific topology or to estimate the maximum number of the neurons that can be implemented into a specific FPGA circuit. For example, in 4VSX35 FPGA circuit with 15,360 Slices, 30,720 LUTs, 192 RAMB16s and 192 DSPs can be implemented approximately 135 neurons: 95 neurons implemented with DSP architecture, utilizing 3965 LUTs from 30720 available and 42 other neurons with their architecture implemented using distributed logics, using the remaining 26755 LUTs. The created Simulink library is presented in Fig. 18 . Using the created blocks can be developed in a very short time and flexible manner a customized, hardware implementable on-chip learning SOM network.
Application
The created ANN is targeted to be used as a pattern recognition module of an artificial olfactory system capable to recognize four different brands of coffee. The system used for data acquisition includes: seven gas sensors chosen to react to a wide spectrum of odours (TGS -family), a temperature and humidity sensor (LM35, SY-HS-230), a test chamber (where all the sensors are mounted), three gas pumps, circuits for sensors conditioning and pumps command, data acquisition board (PCI-MIO-16E-1), pattern recognition module hardware implemented in FPGA (Virtex-4 SX MB -4VSX35) and a user interface developed in Labview 8.2, Fig. 19 . Fig. 19 .
The architecture of the adopted artificial olfaction system
The data to be used for training the ANN has been obtained considering three stages: i) the baseline calculation (the average voltage drop on sensors' resistance when the reference gas-air is applied); ii) and iii) involve a regular absorption / desorption operation measured at a fixed sampled frequency over a defined time when the odorant is applied (Fig. 20) . After the raw extraction, the data was dimensionally reduced using the feature extraction technique. Usually, this is performed by extracting a single parameter from each sensor (e.g. steady-state, final or maximum response), disregarding the initial transient response, which may be affected by the dynamics of the odour delivery system. In this case, the extracted features for each gas sensor are: average value, maximum value, maximum slope of the desorption function, time at value, function integral, integral of the absorption time function, maximum slope of the absorption function, which maximum slope of absorption function occurs and the time at which maximum slope of desorption function occurs. The resulting matrix is presented in Fig. 21 and constitutes the batch data for the pattern recognition system.
Conclusions
A novel neural network Simulink based design strategy has been developed, which benefits of reduced design time over classical field orientation approaches, leading to a low complexity and easy to implement pattern recognition module. The hardware architecture of a SOM network with on-chip learning controlled by a generic control unit, described in VHDL code, has been presented.
The method uses minimal hardware resources for SOM implementation and confers a high modularity and versatility in neural network design. The designed network is a generic one and can be used to design neural networks with the following features: on-line training, on-chip learning and a SOM network topology set by the user.
At opposed to the classical methods, where the network training is made offline, and therefore the implemented designs cannot be changed in real time, a pattern recognition platform implemented in FPGA with learning capabilities, self-organizing and noise tolerance will be able to auto reconfigure in order to obtain the best performance for a given application. A sample case study for an implemented design is presented in [22] , [23] , based on artificial olfaction systems.
A possible drawback of the method consists of its relatively limited portability. Given that, a SOM library is embedded into the System Generator/Simulink/Matlab environment, a design can be implemented only in a Xilinx FPGA family. In order to overcome this impediment, all library components should be redesigned using stand-alone software like VHDL, Handel-C or C. However, in such cases the main expectance of a user-friendly design environment could be lost.
