Abstract-A new hardware implementation of the triangular neighborhood function (TF) for ultra-low power, self-organizing maps (SOM) is presented. Simulations carried out in the software model of this network show that even for low signal resolutions (3-6 bits) performance of the network is not affected. Resolution of the signal at the output of this block has a dominant influence on the circuit complexity as well as the energy consumption. The proposed mechanism is very fast. For a neighborhood range of 15 a delay in the circuit equals 20 ns that allows for data rates of 20-40 MHz, even for large maps with several hundreds neurons.
I. INTRODUCTION
Self-organizing maps (SOM) have been broadly described in the literature. Different networks of this type with different learning rules have been proposed. One of them is the Kohonen SOM, which is referred to as the classical approach [1] . This type of SOMs is trained according to the following formula:
where Ș(k) is the learning rate in the k th training epoch, W j are the weight vectors of particular neurons in the map, while X is an input training pattern in an l th cycle. These neurons that belong to the winner's neighborhood are trained with different intensities that depend on the neighborhood function G() of the topological distance, d, between the winning neuron and the other neurons in the map. In the classical approach a simple rectangular function is used defined as [1, 2] :
where d(i, j) is a topological distance between the winning, i th , neuron and any other, j th , neuron in the map, while R is the range of the neighborhood that is decreased after each epoch.
The common opinion is that better results can be achieved if the Gaussian function is used instead of the rectangular one [3] . The Gaussian function is defined as follows:
Different hardware solutions for the Gaussian function have been proposed [4, 5, 6] , which are mostly analog circuits. On the other hand, in the maps with large numbers of neurons, in which the neighborhood mechanism must be distributed over a large area, digital solutions seem to be much more suitable. The problem in this case is that due to large circuit complexity, the Gaussian function is not easy to be implemented in low chip area and low power digital networks [4] .
To avoid this situation the authors have recently focused on the triangle function, proposing its efficient digital realization and demonstrating that for many NN parameters this function offers a similar performance [7] . This function is defined as:
where a() is the assumed steepness of this function, Ș 0 is the winning neuron's learning rate, while c is the bias value. All these parameters decrease toward zero after each, k th , epoch.
The comparative study for the three functions described above by means of the software model of the Kohonen SOM has been presented in [7] . This study shows that the triangle function is a very good approximation of the Gaussian one. Looking only from the software implementation point of view, this conclusion is of second importance. Since the majority of realizations of such SOMs are based on software platforms, therefore such study to our knowledge has not been undertaken so far. On the other hand, in ultra low power devices this is one of key aspects that must be considered.
In this paper we treat the triangle function as the basis for further investigations, focusing on another important aspectthe influence of the resolution of the output signal of the function block on the learning quality of the SOM. Such investigations are important, since the number of transistors used in the chip (the overall chip area), as well as the energy consumption linearly depend on the signal resolution.
The effectiveness of the learning process of the SOM is evaluated in this paper by use of the quantization error, which is the commonly used criterion in such investigations: (5) where m is the number of the learning patterns, X, in the input data set, while n is the number of the network inputs.
One of the important parameters is the network topology that is defined as a grid of neurons. This parameter determines which neurons belong to the winner's neighborhood for a given value of the distance d [1, 2, 3] . The topologies that are the most frequently used are the rectangular grid with four or eight neighbors and the hexagonal grid [1] . We are referring to them to as rect4, rect8 and hex, respectively. Our previous works show that for different network parameters different topologies are the most suitable, and therefore we have proposed the programmable grid that can work in all these three modes [8] .
II. AN INFLUENCE OF THE RESOLUTION OF THE TF OUTPUT SIGNAL ON THE QUALITY OF THE LEARNING PROCESS
In SOMs realized as digital circuits the signal resolution at the output of the triangular function (TF) has to be minimized in order to reduce the circuit complexity as well as the energy consumption. On the other hand, the system level performance of the network can not be significantly disrupted in this way.
In this section we present selected simulation results of the software model of the network for different signal resolutions. The network was trained with data either regularly or randomly distributed in the input data space. The number of the training patterns was matched for particular map sizes. For example, the map with 16x16 neurons was trained with either 1280 or 2560 training patterns which were either regularly or randomly distributed in the input data space, while the map with 10x10 neurons was trained with either 500 or 1000 patterns.
The results in Figs. 1 -3 are shown versus an initial value, R max , of the neighborhood radius R. The R max parameter is the radius in the first epoch after starting the learning process. The influence of R max on the quantization error for the rectangular neighborhood function, given by (2), has been studied earlier in [9] . It has been demonstrated there that for different input data and different network parameters the optimal values of R max are usually small, even for large maps with hundreds neurons. This conclusion was in contrast to a very common opinion that R max should be large enough to cover at least half of the map at the beginning of learning. This conclusion is very important as low values of R max allow for reducing the circuit complexity [9] .
The results shown in Figs. 1 and 2 are for two inputs and data regularly spread in the input data space, for 16x16 and 10x10 neurons, respectively. Fig. 3 shows example results for 16x16 neurons, for three inputs and data randomly distributed. Top plots in the Figures are for the rect8 grid, while the bottom plots are for the rect4 topology. Te results for the hex grid are not presented, as they are similar to those of the rect8 grid. On the basis of presented results some conclusions can be drawn. In case of regular data it is possible to point out such values of R max , for which the map becomes properly organized for all topologies even for 3 bits of the resolution, as shown in Fig. 4 a. The rect8 topology is more robust, as this optimal case happens for larger range of R max . For smaller maps, e.g. with 10x10 neurons, a low resolution of only 3 bits does not disrupt the learning process, while for larger maps this effect is visible, as shown in Fig. 4 b. Nevertheless, even in this case a proper ordering of the map is achievable for selected values of R max . In the optimal case the quantization error equals 16.18e-3. The nonzero value of this parameter in the optimal case results from the arrangement of data in the input data space.
In the case, shown in Fig. 4b, 29 neurons of 256 in the map are not properly placed, resulting in the Q err enlarged by 25%.
A different situation can be observed in the case shown in Fig. 3 . In this case the best solution has been achieved for the resolution of 6 bits for rect4 topology, although for 3 bits a comparable Q err is achievable for selected values of R max . For rect8 topology the best results has been achieved for the resolution of 10 bits, although for 3 bits Q err is only 13% larger.
The conclusions presented above are of a great importance when looking from the hardware implementation point of view. The number of transistors in case of the resolution of 3-bits and R max = 8 (also 3
, while the power dissipation will be reduced by 30 % in this case. This issue is discussed further in the paper.
III. THE PROPOSED TRIANGULAR FUNCTION BLOCK
Two important aspects should be clearly distinguished. The first one is the neighborhood function block that calculates the factor Ș·G() in (1) for particular neurons using the neighborhood distance d i,j as an input parameter. To determine the distance for particular neurons a different circuit is required. A majority of the reported solutions concern only the first block, while only a few papers present implementation of the second mechanism. One of such solutions is the programmable, clockless, parallel circuit proposed by the authors of this paper in [8] . In that circuit all neighboring neurons were adapted with equal intensities, so there was no necessity to calculate the factor Ș·G(). In the work presented in this paper the authors focus on realization and optimization of the function block.
An idea of the proposed TF block implemented as a digital circuit, as well as performance comparison with the other two functions has been presented in [7] . Here only a short overview of this solution is provided for the explanation. In this paper new results, obtained by means of transistor level simulations in the Hspice environment are presented for the completed programmable neighborhood mechanism, for an example SOM with 64 neurons (8 x 8 grid). Since all these neurons operate in parallel, therefore each neuron needs its own TF block. The proposed TF circuit is described as follows:
The new C, D and E variables determine the shape of the triangular function. The R·E multiplication is performed using a typical shift-and-add circuit. If, for example, a binary number 1001 is multiplied by 110 then a series of add operations is performed say, 0·1001+1·10010+1·100100. The summing operation is performed by use of typical multi-bit adders. Since we limit the allowed values of D to the numbers which are the following powers of 2 i.e. {1, 2, 4, 8, …}, therefore division is realized quite simply by shifting all bits in the R·E product to the right, using the circuit shown in Fig. 6 .
To illustrate how the proposed TF circuit operates, several example cases are shown in The number of bits in the C, D, E and R variables has the influence on the circuit complexity, as described above. The number of the multi-bit adders in a single multiplier is linearly proportional to the resolution of E. On the other hand, the resolution of R determines the number of 1-bit full-adders in particular multi-bit adders. In the proposed TF block a 1-bit adder composed of 26 transistors [11] has been used. Several other solutions with even less numbers of transistors were also considered, but in simulations the solution of [11] was the most efficient, considering the speed and the power dissipation.
Figs. 1 -3 show that quite good learning quality is possible for R max < 9 i.e. for the resolution of only 3 bits. In the worst case, shown in Fig. 3 , the optimal resolution of the Ș·G() factor is 6 bits. As a result, the resolution of E should also equal 6.
IV. HARDWARE REALIZATION OF THE TRIANGULAR FUNCTION
In the proposed circuit all neurons in the map operate in parallel. Simulations in the CMOS 0.18µm technology show that input data rate can be as high as 20 -40 MHz, depending on the number of the network inputs and type of the topology. Each neuron for a single learning pattern X performs about twenty arithmetic operations (for n=3). As a result, the map with 64 neurons performs 50e9 operations/s, at the power dissipation of 50-100 mW. Larger map with 1000 neurons will achieve even 1e12 operations/s. One operation means e.g. addition, multiplication, searching for the winning neuron etc.
Since all neurons in the map are composed of equal blocks, therefore any reduction of the complexity of any block in one neuron has an effect on the complexity of entire map. The most complex block in the proposed triangle function is the multi-bit multiplier, which is composed of several multi-bit adders. In this approach this shift-and-add multiplier has been realized using an asynchronous binary tree concept. At the first layer of the tree the bits 0 with 1, 2 with 3, are added in parallel, and so on. Then in the next layer the results of the pairs 1-2 are added to 3-4, 5-6 to 7-8 and so on. The number of the adders at each following layer is always reduced by half in comparison with the previous layer. In the binary tree approach for a resolution of ț bits a delay which is introduced by the multiplier equals T add ·log 2 ț, where T add is a delay of a single multi-bit adder.
out m out out out out Figure 6 . The structure of the bits-shift block that shifts the bits to the right, thus dividing the signal by D that is always equal to a power of 2.
Division operations usually require complex circuits. In this solution the R·E product is divided by selected values only and therefore this circuit becomes very simple, as shown in Fig. 6 . The bits-shift operation is performed by use of a set of switches directly controlled by particular bits of the D variable:
Only one bit in this variable is allowed to be equal 1, thus the division is based on the following scheme: One additional circuit must be used in the bits-shift block. Shifting the bits to the right by p positions makes the terminals that correspond to the p most significant bits floating and have to be connected to the ground to avoid the ambiguity at these terminals. This is realized by additional switches (one per each terminal) which are controlled by the signals also dependent on particular bits of the D parameter. Instead of the switches, realized here as transmission gates, a series of the AND gates could be used as well, but at the expense of larger number of transistors and a little bit larger power dissipation.
The performance of the TF block is presented in Fig. 7 . In this simulation a series of multiplications and divisions is performed for R decreasing from 15 to 0 and E decreasing from 31 to 0 i.e. for 512 combinations. The R·E products are then divided by 32 (shifting by 5 bits). Sampling period equals in this case 20 ns but the circuit works properly also for 6 ns. Fig.  7 shows also the supply current. The height of the current spikes varies in-between 1 and 25 mA, while the width of these spikes is less than 1 ns. Average energy consumption does not exceed a few pJ per a single operation. Fig. 8 illustrates transistor level simulations of entire neighborhood mechanism with 8x8 neurons operating in the rect4 mode. This mode allows for reaching the highest distances, as only the horizontal and the vertical directions are allowed. Looking from the data rate point of view this is the worst case. The top diagram illustrates the enable signals, EN, in the first column of the map. These signals trigger the adaptation process in particular neurons. Once the EN signal arrives at the bottom row of the map, the propagation starts in this row, as shown in the second plot. A delay between the EN signal at the first (1, 1) and the last (8, 8) neurons in the chain equals only 14 ns. Since a delay of a single TF block equals 6 ns, the entire map is ready for the adaptation after 20 ns. For the rect8 and the hex modes this time is even shorter as the diagonal directions are also allowed in this case. The other operations performed by the map take 20-30 ns depending on the number of the inputs.
V. CONCLUSIONS
A new very fast and power efficient triangular neighborhood function (TF) for hardware realized Kohonen SOMs has been proposed. The proposed circuit is a digital programmable block that is robust against process voltage temperature (PVT) variation, so it can be used in the commercial applications. Such simulations were performed for a wide range of the parameters. The results will be presented in the future papers.
The presented results show that even low signal resolutions at the output of the TF block allow for a proper performance of the SOM, while it allows for a significant reduction of both the circuit complexity and the power dissipation. In the proposed SOM all neurons operate in parallel. As a result, large neural networks can achieve the computational complexity even as high as 1e12 operations/s in the CMOS 0.18µm technology.
