Previous work on analog VLSI implementation of multi-layer perceptrons with on-chip learning has mainly targeted the implementation of algorithms like backpropagation. Although back-propagation is ecient, its implementation in analog VLSI requires excessive computational hardware. In this paper we show that using gradient descent with direct approximation of the gradient instead of backpropagation is cheapest for parallel analog implementations. We also show that this technique (we call \weight perturbation") is suitable for multi-layer recurrent networks as well. A discrete level analog implementation showing the training of an XOR network as an example is also presented.
I. Introduction
Many researchers have recently proposed architectures for VLSI implementation of multi-layer perceptron. Most of the reported work has addressed digital implemen-tation [2] . Furman and associates [1] have reported an analog implementation of backpropagation. In both digital and analog reported work, back-propagation was selected because of its eciency and popularity. For analog, the implementation of on-chip learning algorithms requires bi-directional synapses and the generation of the derivative of neuron transfer functions with respect to their input. As the area and power of a synapse has an important eect on the total chip area and the resulting size of network that can be implemented, its minimisation is important. Furthermore, in analog implementation, the generation of the derivative is a rather of a dicult task. These complications are even more amplied when recurrent networks are being implemented as the number of synapses may grow as the square of the number of neurons.
Recently, the Madaline Rule III (MR III) was suggested as an alternative to backpropagation for analog implementation [4] . This rule can be considered implementing gradient evaluation using \node perturbation" according to:
where net i = P j w ij x j and x j = f(net j ) with f being the non-linear squashing function. Therefore, in addition to the actual hardware needed for the normal operation of the network, the implementation of the MR III learning rule for an N neuron network in analog VLSI requires:
An addressing module and wires routed to select and deliver the perturbation to each neuron.
either one or N multiplication hardware to compute the term 1E 1net i x j in addition to the multiplication by the learning rate. If one multiplier is used then additional multiplexing hardware is required.
An addressing module and wires routed to select and read the x j terms.
Note that if greater training exibility is required in the sense of o-chip access to the gradient values, then the states of the neurons (x j ) would need to be made available o-chip as well, which will require a multiplexing scheme and N chip pads.
An alternative approach to node perturbation is \weight perturbation" where the gradient is approximated to a nite dierence, and we show in this paper that gradient evaluation using \weight perturbation" is a cheaper solution, hardware and complexity wise, and can be equally used to train recurrent networks.
II.
Gradient Evaluation using Weight Perturbation
The gradient with respect to the weight can simply be evaluated by the approximation
if the perturbation 1 p ertw ij is small enough. This is the forward dierence method and the weight update rule becomes:
where E() is the total mean square error produced at the output of the network for a given pair of input and training patterns and a given value of the weights. 
however, the number of forward relaxations of the network required is of the order N 3 rather than N 2 for the forward dierence method. Thus either method can be selected on the basis of a speed/accuracy trade-o.
Note, that as and pert ij are both constants, the analog implementation version can simply be written as:
with G(pert ij ) = 0 pert ij and 1E(w ij ; pert ij ) = E(w ij + pert ij ) 0 E(w ij )
The weight update hardware involves the evaluation of the error with perturbed and unperturbed weight and then the multiplication by a constant.
This technique is ideal for analog VLSI implementation for the following reasons:
1. As the gradient E w ij is approximated to Epert0E 1pertw ij (where 1 pert w ij is the perturbation applied at weight w ij ), no back-propagation pass is needed and only forward path is required. This means, in terms of analog VLSI implementations, no bidirectional circuits and hardware for the back-propagation are needed. The hardware used for the operation of the network is used for the training. Only single simple circuits to implement the weight update are required. This simplies considerably the implementation.
2. Compared to node perturbation our technique does not require the two neuron addressing modules, routing and extra multiplication listed above.
Our technique does not require any overheads in routing and addressing connections to every neuron to deliver the perturbations as the same wires used to access the weights are used to deliver weight perturbations. Furthermore, node perturbation requires extra routing to access the output state of each neuron and extra multiplication hardware is needed for the 1E 1net i x j terms which is not the case with weight perturbation. Finally, with weight perturbation, the approximated gradient values can be made available if needed at a rather low cost 1
III. Simulations
The \weight perturbation" technique was used on two test cases: XOR (feedforward and recurrent) and Intra-Cardiac Electro-Grams (ICEG). The learning procedure is implemented as shown in Figure 1 .
The XOR network used has 1 hidden layer with two hidden units, two input units acting as pins and one output unit. The training was done in the on-line mode. The network parameters are shown in Table 1 . The total mean squared error is shown in Figure 2 for both training with back-propagation and weight perturbation. Figure 2 : Mean square error for XOR training using back-propagation (xor-bp.err) and weight perturbation (xor-wp.err).
As Figure 4 shows and as one may expect, the overall shape of the error as function of training iteration are very similar. All 4 XOR patterns are training in 145 iterations with both techniques (the nal average mean square error are not however equal). A study of the weight produced by both techniques show that they are extremely similar (the dierences were not visible from weight density plots).
ii. Recurrent XOR A multi-layer recurrent was trained using weight perturbation. The architecture of the network is shown in Figure 3 and the training parameters are shown in Table 2 .
The same architecture was trained using Recurrent Back-Propagation based on the algorithm of Pineda [3] . The training error curves are shown in Figure 4 . Although the two training techniques started from identical initial conditions, the convergence speed was dierent and the nal weight solution was dierent for both techniques. This Another of our tests is on the training of a three layer perceptron to classify IntraCardiac Electro-Grams (ICEG). The size of the training set is 120 patterns, and the network has 21 input units, 10 hidden units and 5 output units. Figure 5 shows the mean square error for weight perturbation and back-propagation training.
Following the training we have tested the trained networks on a set of 2600 patterns. The training with back-propagation and weight perturbation has led to an identical performance 91% of correct classication. To show the feasibility of learning with analog implementation of weight perturbation, we have constructed a discrete component implementation of an XOR network. Figure 7 shows a block diagram of the network used and Figure 6 shows a picture of the hardware implementation (synapse and neuron boards). In addition a PC was provided as a controller to orchestrate the presentation of training vectors and weight updates. The weights were stored as a voltage on capacitors which are periodically updated.
The voltage range for a signal is 610:0V and the weight values also have a range of 610:0V . This means that a mean square error of 10:0V 2 indicates that the network output signal and the training signal vary by 32%. Figure 8 shows a training session of the xor network reaching a mean square error of 8:0V 2 using the weight perturbation algorithm. The noise apparent is due to A/D sampling errors and noise on the network due to weight and training vector refresh. We note that convergence occurs inspite of this noise level, demonstrating the robustness 
V. Conclusions
In this paper we show that weight perturbation is a very cheap and exible learning technique for analog implementations of neural networks. We also show that it is more exible than back-propagation and node perturbation (MR III). We demonstrate using simulations that weight perturbation produces the same performance as back-propagation and recurrent back-propagation. A discrete analog implementation was used to demonstrate the feasibility of multi-layer feedforward training using weight perturbation. The same technique can be used to train simple recurrent networks (like Elman networks) and continuously running recurrent networks for temporal sequences recognition (like Williams and Zipser networks). For all these networks it is easy to see that as far as training is concerned, the hardware implementation using a weight perturbation architecture is very similar to that required for the normal operation of the networks. 
