The hardware implementation of 
1: Introduction
During the last decade it has been demonstrated that neural networks are capable of providing solutions to many problems in the areas of pattern recoginntion, signal processing, time series analysis, and many others. While software simulations are very useful for investigating the capabilities of neural network models, they cannot fulfill the need for real-time processing that is necessary for a successful application of neural networks to most real-world problems. To fully profit from the inherent parallelism of neural networks, hardware implementations are essential. However, the mapping of existing neural network algorithms or their resultant networks onto fast, compact, and reliable hardware is a difficult task. Therefore, learning rules u-hich are better suited for hardware implementation have: been proposed. These hardware-friendly learning algorithms can be divided into two subclasses, namely:
Adaptations of existing neural network learning
rules that facilitate their hardware implennenta-tion and lead to a better exploitation of chip area and parallelism.
2. Learning algorithms that are by their conception suitable for hardware implementation.
An example of the first class, are the perturbation algorithms that eliminate the calculation of the derivative of the activation function and the need for separate circuitry for the backward pass of the widely-used backpropagation algorithm [20] [43]. An example of the second class are cellular neural networks that represent a general class of networks, the original definition of which has even been given in terms of analog circuitry 191 and which, due to their local connectivity, are suited for VLSI implementation. In this paper an overview is given of hardware-friendly algorithms for various neural network models. However, before presenting the remedies, some of the typical problems encountered in the hardware implementation of neural networks are outlined.
2: Hardware implementations of neural networks: problerris and constraints
Any kind of implementation of neural networks, be it analog electronic, digital electronic, optical, or their hybrid, brings along various constraints:
Accuracy As compared to the ideal neural network models, hardware implementations can only offer a limited accuracy. Examples of this phenomenon are: (i) the representation of weight values with a small number of bits as opposed to the realvalued weights in the model, ( 2 ; ) non-uniformity of circuit components which are ideally supposed to be identical, (iiz) non-linearity effects in components such as multipliers.
Area
The design of hardware implementations requires a constant interplay between the accuracy required, the (chip) area available, and the degree of parallelism. Reliable elements are often available but their incorporation comes at the price of an area penalty or a reduction of the degree of parallelism.
One can envisage two different approaches to solve these hardware related problems. Firstly, an improvement of the hardware required for the implementation of neural networks is of crucial importance. For example, the use of pulse modulation techniques which combine the advantages of digital and analog electronics [16] or the design of compact and reliable components. The second approach is to try to overcome them by adapting existing learning algorithms or by designing new hardware-friendly algorithms, which is the focus of this paper.
3:
Hardware-friendly learning algorithms
3.1: Multilayer feedforward networks
The most popular algorithm for training multilayer feedforward networks is doubtlessly the backpropagation algorithm, popularized by Rumelhart [35] . Its realization in analog hardware, however, poses some serious problems because of the need of an accurate derivative of the activation function in the calculation of the gradients on the error surface. Another disadvantage is the need for separate circuitry for the backward pass of the algorithm. Perturbation algorithms The general idea behind perturbation algorithms is to obtain a direct estimate of the gradients by slightly perturbing the parameters of the network and using the forward path to measure the change in the network error; this also implies that the circuitry for the backward pass can be dispensed with, Another advantage is that no a priori knowledge of the non-linearity is used and hence, that the implementation is likely to be more robust for hardware non-linearities.
The first perturbation algorithm was the Madaline-111 rule [43] and is based on neuron perturbation, that is, each gradient is estimated by perturbing the input value of a neuron. However, since each of the node perturbations has to be done serially the computational complexity increases considerably as compared with standard backpropagation. Moreover, it requires some extra circuitry for the addressing and selection of neurons. Perturbation of the weights (see figure 1 ) eliminates some of this extra circuitry needed to implement node perturbation. It also performs better when limited precision weights are used [20] . This comes at the price of an even higher computational complexity which stems from the fact that all weights (except the weights to the output layer) have to perturbed serially.
The complexity of the weight perturbation algorithm is addressed and reduced viewing a perturbation of the weights incoming to a neuron as a summed weight perturbation of that neuron. The result is a weight perturbation method that improves upon the Madeline-I11 rule, since it does not require access to hidden neurons and has the same computational complexity [13] . This scheme has been actually implemented in hardware and shows good behaviour on some small benchmarks.
The loss of parallelism in the weight perturbation scheme can also be overcome by perturbing all weights simultaneously and using the resulting error to update the weights [2] . For a reliable estimate of a gradient, multiple perturbations should be performed, but this number is normally quite small compared to the total number of weights in the network. A similar approach has in fact been pursued by Cauwenberghs [8] studying in some more detail its convergence properties.
Chain-rule perturbation E181 also addresses the complexity of standard weight perturbation and employs a chain rule approximation of the gradient that enables all weights going out of a neuron to be perturbed in parallel. It improves upon the Madeline-I11 rule and summed weight perturbation [13] because it does not make any assumptions about the multiplication, allowing non-linear synapses which are typical for many analog implementations.
Local learning algorithms In [32] an anti-Hebbian local learning algorithm is described in which the weight update for a certain layer only depends on the input and output of that layer and a global error sig-nal. This local learning rule circumvents the backpropagation of error signals that complicates the hardware implementation of the backpropagation algorithm. Although it is not a gradient descent rule, it is still guaranteed that the synaptic weights are upda.ted in the error descent direction. Brandt and Lin [6] have also developed an algorithm that requires no explicit backpropagation of errors and uses information l~ocally available at a neuron, most importantly the rates of change of the outgoing weights. One of their local algorithms is equivalent to the standard backpropagation algorithm. Their algorithm and especially the measuring of the rates of change of the weights might, however, still be hard to implement.
Training without derivatives
The necessity of the derivative of the activation function in the backpropagation algorithm can be circumvented by an approximation, which only needs the non-linearity itrrelf in the backward pass. This is, for example, established by a well-chosen Taylor expansion that offers a close approximation to the original algorithm [17] .
A completely different approach to exclude derivatives from learning algorithms has been taken by Battiti and Tecchiolli. Their reactive tabu search is a heuristic method that can solve combinatorial optimization problems [5] . It can be applied to the training of neural networks by transforming the continuous space of the weights into a discrete one by a Gray encoding of the weight values. The heuristic that is8 used to obtain a new set of weights is to choose the coinfiguration with the smallest error value, that differs only in one bit from the current configuration; because of the Gray encoding this method performs in fact a discretized form of steepest descent. In order to avoid cycles when changing the weight configuration and not to be confined to a limited part of the search space, some additional constraints are included in the heuristic. Another advantage of the reactive tabu search is the limited precision of the weights that is needed. These characteristics make it suitable for hardware implementation, as is illustrated by the TOTEM chip [4] .
Complex backpropagation
In some applications one would like a hardware implementation of an NN that accepts sinusoidal (complex-valued) signals. In [15] backpropagation is therefore extended to the complex domain allowing complex-valued inputs, weights, activation functions, and outputs. It carefully solves the problem of the design of a suitable complex activation function that is bounded, non-linear, differentiable, and easily implementable. These properties exclude for example the complex extension of the standard sigmoid function, which is unbounded.
Threshold networks The design of a compact digital neural network can be simplified considerably by using hard-limiting threshold gates as activation functions instead of a differentiable (sigmoidal) nonlinearity. While training algorithms for two-layer threshold networks, that is perceptrons, abound, they are limited to solving linearly separable problems only. This constraint can be resolvled by allowing more layers of neurons, but most algorithms for training these multilayer networks are based on gradient descent and require a differentiable activation function. The development of training algorithms for multilayer networks with a threshold as activation function is therefore an important issue for NN hardware implementation.
The Madaline-I1 rule [43] is closely related to the neuron perturbation of the Madaline-111. However, the discontinuities in a threshold network exclude the direct estimation of a gradient. Therefore, the error is minimized in the following way: a small neuron input in the second layer is perturbed to see whether an inversion of the activation v a h ? of this neuron reduces the Hamming error on the output neurons. If this is the case, the incoming weights of this neuron are adapted with a perceptron algorithm to reinforce this inversion.
If not, the same procedure is repeated until the output layer is reached, the weights of which are directly updated by a perceptron algorithm.
Other approaches have been trying to use standard backpropagation to obtain threshold networks. In [lo] the steepness of the sigmoidal non-linearity is gradually increased during training of the network to obtain a final network with only thresholds, using the fact that the sigmoid function approaches a threshold when the steepness parameter approaches infinity. This algorithm can be useful when off-line training of the network is appropriate.
There is a host of so-called constructive algorithms that are gradually building a threshold network by adding neurons and weights [37] . One recent example is the use of a geometrical approach is to construct a multilayer threshold network [23] . The goal is to find a set of separating hyperplanes (hidden layer neurons) in the input pattern space with the property that inputs located between two neighbouring hyperplanes have the same target output. Another advantage of these constructive algorithms is that the number of neurons in the hidden layer need not be specified a priori. Of course, constructive algorithms are not well-suited for implementation in hardware, but the resulting compact threshold networks are.
3.2: Kohonen's self-organizing maps
The basic elements of Kohonen's algorithm for selforganizing maps that reproduce the input probability distribution in a compact way are, at time t:
The selection of the neuron with weight vector w(t) closest to the input pattern x ( t ) (winner neuron), that is, with minimal distance d(i):
(ii) The update of the weights according to where a(t) is the adaptation gain and A ( t ) is the nezghbourhood function which depends on the winner neuron and the neuron under consideration.
From the above description it is clear that the original algorithm is demanding in terms of computing resources like calculation of the Euclidean distance, multiplication, weight storage, and adaptation functions. Various adaptations to the algorithm have been proposed to assist its hardware implementation.
[Thiran-94] A crucial issue in hardware implementations is the influence of the quantization of weights and inputs on the behaviour of the learning algorithm. This paper studies the quantization effects on a Kohonen network and demonstrates that its consequences can be greatly reduced by having a neighbourhood function that decreases with the distance between the winner neuron and its neighbours. Their experiments show that five bits can be sufficient to guarantee convergence to a solution close to the solution obtained without quantization [40] . In this way an implementation can be obtained that uses no multipliers, has a high degree of parallelism, and at the price of slightly bigger map size (10% to 15%) shows results comparable to the original algorithm [34] .
[Vassilas-95]
One of the demands of a N N implementation on systolic arrays is the effective use of the processor resources. In general, batch processing is an appropriate means to obtain better parallelisation. Kohonen's original algorithm, however, has both on-line winner selection and online weight update. Two possible variants are (2) batch winner selection and batch weight update
(ii) batch winner selection and on-line weight update. In Vassilas' paper it is shown that the convergence properties of these variants are comparable with the original on-line algorithm and that the second variant is almost identical in performance to the original algorithm [41] .
3.3: Recurrent networks
The class of recurrent networks exhibits complex dynamical behaviour and needs a reliable method for training and recall. Two such widely used paradigms for training recurrent networks are the Boltzmann machine and mean field theory (MFT) learning.
Boltzmann machine
The Boltzmann machine is a stochastic learning rule which uses only locally available information and is for that reason well-suited for hardware implementation. The parallelism of a potential hardware implementation is, however, severely constrained by the required asynchronous update of neurons. Therefore, in a recent analog neurocomputer a synchronous version of the Boltzmann machine is used [33] . Another peculiarity of the Boltzmann machine is its use of simulated annealing by a gradual increase of the gain of the activation function. In Bellcore's implementation of a Boltzmann machine this annealing schedule is replaced by a gradual decrease of additive noise, which can be efficiently implemented in analog hardware [l] .
Mean field theory
This method is based on the idea that the simulated annealing process in the stochastic Boltzmann machine is too time-consuming and can be replaced by a deterministic mean field approximation.
In an optical design [29] it is demonstrated that an MFT algorithm with synchronous updating of neurons leads to good results and is suitable for hardware implementation.
3.4: Other types of neural networks
RAM-based networks This is a special class of neural networks based on random access memories that are fit for hardware implementation. This network model can be easily implemented in standard available components, but has the disadvantage of a limited learning capacity. Therefore, various generalizations of the original model have been developed with extended capabilities, but also a more complex realization in hardware. A recent overview of RAM-based networks and related implementation aspects can be found in [3] . Cellular neural networks Cellular neural networks are of particular interest for VLSI implementation because of their sparse connectivity. Every unit of the network is a simple analog processor that interacts directly only with its neighbouring units within an often small range. Since the range of the network dynamics and the connection complexity are independent of the number of units, its implementation scales up well to bigger networks. An extensive overview by Chua and Roska of the cellular neural network paradigm can be found in [9] .
4: Inaccuracy and robustness
A key issue in hardware implementations of a. l R neural network models is the required precision of its parameters, since any hardware implementation i:; liable to imperfections such as limited numerical precision, component imprecision, noise, and stuck-at faults in weights and neurons. All of these inaccuracies have been subject of investigation and most neural network models show in fact a remarkable degree of robustness when these inaccuracies are incorporated during the training of the network [14] [27] [as].
4.1: Limited precision
Two different types of limited precision can be discerned. Firstly, the limitation by the representalion of values by a small number of bits; this plays a major role in digital neural network implementation. Secondly, limited precision can be caused by component I mprecision, for example non-linear responses of multipliers and variations between components. This problem is paramount in both electronic and optical implernentations. A large range of theoretical and experirnental studies has been performed to investigate and confine the effects of limited precision computation.
Limited numerical precision
The accuracy i,hat is needed for representing the weights of a neural network is area consuming and is incompatible with the system noise in analog implementations. Hence, the number of different weight values of the network should be as small as possible in order to obtain an efficient and accurate implementation. For different network architectures and learning algorithms the effect of such a limited weight precision has been investigated. The common tenor of these investigations is that below a certain level these limitations have a large influence on the behaviour of the network. For example, the accuracy needed in the standard backpropagation training algorithm in order not to deviate too much from the ideal learning trajectory is generally found to be 14 to 16 bits [19] . Note that, the accuracy needed in the forward pass lies around 8 bits [19] [30].
In order to reduce the chip area needed for weight storage and to overcome system noise, a further reduction of the number of allowed weight values is desirable.
Hence, weight discretization algorithms based on the backpropagation learning rule have been designed for training multilayer networks with a very limited number of weight values (2-4 bits) [12] [38] . The rationale behind these weight discretization algorithms is to keep and update the weights with a high resolution off-line and use the discrete weights in the forward pass; these methods are therefore best-suited for off-chip training.
Component imprecision
The state-of-the-art in analog (optical and electronic) hardware has progressed considerably over the last decade However, compared to digital technology it is not yet a mature discipline and the design of reliable and identical components gives rise to problems. In analog electronic implementations it is, for example, complex to efficiently construct a linear multiplier with a sufficient operating range, and simple non-linear multipliers are therefore often preferable or even inevitable. Examples of the use of non-linear multipliers can be found in both analog electronic [24] and in optical implementations [as] .
It is also shown how the back propagation learning rule can compensate for the non-linearity of multiplications by incorporating this non-linear multiplier in its derivation [24] .
Another problem of analog hardware is the construction of an activation function that is close to the widelyused standard sigmoid. However, the incorporation of an accurate model of the hardware activation function in the training algorithm can compensate for this inaccuracy [24] . Additional difficulties arise in an analog optical implementation of the sigmoidal function based on intensity encoding, namely the limitation to nonnegative values which means that the non-linearities are shifted into the non-negaltive domain and a gain (steepness) that differs greatly from one [36] 
4.2: Robustness
Until a few years ago robustness of neural networks was mainly a folk theorem, but it has been investigated quite thoroughly these last years. In the above, several examples have already been given of the robustness in neural networks to inaccuracies. Here, the influence of faulty weights or neurons and noise will be discussed in some more detail.
Faulty weights and neurons
The removal of the interconnection weights in a network and the occurrence of stuck-at faults in neurons are two types of faults that can serve as a test bed for the robustness of neural networks. The robustness of a backpropagation trained multilayer network to removing weights to/from the hidden layer and the influence of redundancy in the form of excess hidden neurons has been investigated [ll] . While graceful degradation of the network performance under weight removal was observed, the addition of more hidden neurons did only deteriorate the results. Hence, it can be concluded that standard backpropagation training is not znherently fault-tolerant. An augmentation technique that tries to introduce linear dependencies in the already trained network by adding hidden neurons leads to better robustness.
The effect of "stuck-at-0" and "stuck-at-1" neurons on the solutions found in recurrent optimization networks is investigated in [31] . This type of network exhibits a high degree of fault-tolerance and an ability to find sub-optimal solutions to optimization problems in the presence of stuck-at faults in neurons.
It is widely believed that one should not verify the robustness of a neural network model a posteriorz, but incorporate a robustness criterion in the training phase. This can be done by changing the objective function that has to be optimized during training or by injecting the expected faults during training. An illustration of this fact is an adaptation of the backpropagation learning rule that uses only a random subset of hidden neurons during each iteration. The resulting network is far more robust to the destruction of hidden neurons with only a small loss of accuracy in the noiseless case P21.
Noise and perturbation
The surprising effects of the injection of random noise on the weights of a multilayer neural network when training by the backpropagation algorithm has been elaborately discussed by Murray and Edwards [27] . Both analytically and experimentally it is demonstrated that synaptic noise improves the network's fault tolerance to weight damage, generalization on unseen patterns, and the training trajectory. Similar results have been obtained when injecting additive noise into the weights of recurrent neural networks [21] .
A theoretical study of the effect of perturbations of the parameters in a general class of feedback neural networks is studied in [42]. In this type of network it is important to know whether the stable states of the perturbed networks are close to the original stable states. It is shown that under reasonable conditions a linear relationship holds between the perturbations of the network parameters and the resulting error in the stable states.
: Summary and conclusions
In this paper an overview has been given of a variety of methods that have been developed to facilitate the hardware implementation of neural network models. Each of the well-known neural network models brings along its specific problems for hardware implementation. While, for example, for the standard backpropagation algorithm the use of an accurate derivative of the activation function complicates the implementation, the realization in hardware of a Boltzmann machine is hindered by the sequential update of neurons. Most of the hardware-friendly algorithms that have been described here are geared towards the implementation of on-chip learning. The advantages of on-chip learning are manifold and include besides the gain in speed, an inherent compensation for component inaccuracies and the adaptation to new training patterns. However, most of the on-chip learning rules described in this paper have not been realized in hardware and their efficacy is difficult to judge. Some notable exceptions are Bellcore's implementation of a Boltzmann machine and the mean field theory algorithm [l] , and Battiti's TOTEM-chip based on the reactive tabu search [4] .
Attention has also been given to learning algorithms that are not suited for hardware implementation themselves, but the resulting network of which can be efficiently implemented. An important example of this class are the threshold networks, the training of which is often based on constructive methods that evolve the network's topology during training.
A key problem for all realizations for neural networks in hardware are the inaccuracy and imperfections of the hardware components. This ranges from quantization of t h e weight values a n d comporient-tocomponent variations to stuck-at faults of weights a n d neurons. Most of these aspects have been discussed a n d it has been exemplified that neural network models are remarkably robust to this limited precision when the inaccuracies are incorporated during the training of t h e network.
