INTRODUCTION
In this paper we describe an analogue VLSI circuit which implements the Radial Basis Function (RBF) algorithm, based around a compact Euclidean Distance calculator. A prototype chip has been fabricated and tested. First, we describe the RBF architecture and present results from the test chip. We then discuss the impact of device variations, which traditionally cause problems for analogue systems. Finally, we discuss the scaling of our results from the small test chip to a full-sized RBF system.
RBF ARCHITECTURE
In the Radial Basis Function (RBF) algorithm the outputs, O j , are generated as linear combinations of non-linear functions, , usually referred to as centres or kernels. The argument of each centre i depends upon the distance between the input vector x and a reference vector y i , O j = X i w ij (jjx ? y i jj; i ) + b j (1) where w ij is the weight between centre i and output j, i is the width of the i th centre, and b j is a bias term.
The RBF algorithm has two features which make it particularly suitable for an analogue implementation. Firstly, the algorithm is adaptive, and this fact can be used in conjunction with \chip-in-the-loop" training and analogue oating-gate devices to compensate for device variations. Secondly, the independent nature of the calculations implies that a highly parallel, high speed implementation can be achieved. Figure 1 shows the oorplan of an analogue RBF chip. The input vector x is presented in parallel to a set of Euclidean distance cells, each of which calculates the distance between x and a stored vector y. These distances are then passed through a non-linearity, and into a matrix of multipliers. The output vector is then formed 
Euclidean Distance Calculator
The basis for our analogue distance circuit is the similarity between the square of the Euclidean distance between two vectors, x and y, For convenience we chose to design our prototype chip using volatile oating-gate devices, where the reference data point is stored on a capacitor, as shown in gure 2, and refreshed periodically. In order to store a point in the circuit, the desired data point (V data + V min ) is applied to the control gate marked in, and its complement (V max ? V data ) is applied to the other con- rent is a measure of the distance between V data and V in .
The output currents from multiple circuits can be summed using Kirchho 's law to give the square of the Euclidean distance between two multi-dimensional vectors 1]. The p-channel load transistor, also operating in saturation, acts as a current-voltage converter, taking the square root of the distance current. Hence the output voltage is proportional to the Euclidean distance between the input and reference vectors, with a maximum voltage output at the minimum distance between vectors. Figure 3 shows the measured output of a one-dimensional Euclidean distance circuit, demonstrating the programmability of V data . Although only one load transistor per vector is required, its input current, and therefore its optimal size, will be dependent on the number of dimensions in the vector. We therefore use a parallel distributed load, consisting of one diode-connected transistor per dimension, thus giving a modular architecture. The input vector, which is represented by voltages, is easily distributed to many reference vectors in parallel. This makes the circuit a suitable basis for a highly parallel architecture. To obtain the desired nonlinearity, the output voltage of the distance circuit is fed into one input of a long-tail pair which implements the kernel or \bump" function. The other input is set to a programmable \width" voltage, which is itself generated by a single Euclidean distance cell. When the \distance" voltage is greater than the \width voltage", i.e. the input is closer to the associated centre than a preset width, the tail current is steered into the output branch of the circuit. Conversely, when the input is much further away from the stored centre, the \width" voltage is higher than the distance \voltage", and all the tail current is steered away from the output branch. The resulting output of the circuit is a bump current, centred on the stored vector, with a programmable width. Figure 5 shows some typical measured results. For large widths, the output saturates, and the resulting bump has a at top. In contrast, for very narrow widths the output will not reach the maximum desired current. This can be compensated for by the subsequent multiplier stage and therefore is not a problem. Figure 6 shows a contour plot of the measured output of the bump circuit in two dimensions. This data was taken from a circuit with two Euclidean distance cells, centred at (2:5; 2:5), and a bump circuit with its width set at 1V. The contour plot con rms that the output is indeed radially symmetric. This is in contrast to the kernel function of 4], which is star-shaped and does not have a fully-programmable width. The more conventional shape of our kernel is closer to that used in software implementations of the RBF algorithm.
Multiplier
An analogue multiplier, or \synapse", is required to implement the multiplication by the weight coe cients w ij in Equation (1). In gen- The bump circuit produces a unidirectional output, and hence a two quadrant multiplier (positive bump times positive or negative weight) is needed. The key requirement for this application is that the multiplication should be linearly dependent on the output of the bump circuit. Any nonlinearity in the multiplication by the weight value is less important, as it can be compensated for during chip-in-the-loop training. The multiplier of gure 7 is based on a modi ed Gilbert multiplier, and is similar to the synapse circuit used in the ETANN chip 5]. The basic structure is that of a transconductance ampli er composed of devices M1{M5. This multiplies the tail current, I bump , by a voltage di erence, V b ? V a , where V a and V b are voltages stored on the oating gate of devices M1 and M2. However, since the tail current of M3 can vary from zero to several microamps, depending on how close the input data point is to the centre of the associated bump, the simple transamp circuit operates in both the subthreshold and the saturated regions, with a noticeable transition point where the gain changes. This reduces the linearity of the multiplication by the bump current below that acceptable. The circuit can be modi ed by adding on a bias current of a few microamps, set by the voltage V mult bias , which ensures that the transamp formed by devices M1{M6 is always biased in saturation. A second long-tail pair, M7{M9, where M7 and M8 share the oating gates of M1 and M2 respectively, subtracts o this bias current to implement the overall function The linearity of the multiplication by the bump current is greatly improved by the inclusion of this bias current.
Operation as a Classi er
Our test chip was designed and built using the 2 m Orbit CMOS process. The small size and pin-count of the TINYCHIP (2.2mm by 2.2mm, and 40 pins) limited the size of the RBF network which could be implemented to three input dimensions, three kernels, each with programmable width, and two output classes. However, this was large enough to demonstrate the use of the analogue RBF circuit as a simple twoclass classi er. Figure 9 shows the measured class output contours for a simple classi cation problem. Class 1 consists of the weighted sum of two kernels, centred at (2.5,2.5) and (4, 3) . Class 2 consists of the remaining kernel, centred at (4,
An edge e ect can be seen in gure 9, where contours are distorted at the edge of the grid. This is believed to be due to the fact that the opamps used to generate the complementary input V max ?V in have a restricted input range and cease to work within 0.5V of V DD .
IMPACT OF DEVICE VARIATIONS
As with all analogue circuits, unavoidable interdevice parameter variations will occur, leading to mismatches between transistors. This has traditionally limited the dynamic range of analogue circuits. However, oating-gate devices are an intrinsic part of the Euclidean distance and multiplier subcircuits, and as Carley has pointed out 6], oating-gate devices can be used to trim analogue circuits to compensate for variations between devices. The RBF is an adaptive algorithm, and this too can be exploited to compensate for device variations during a simple \chip-in-the-loop" training procedure. The Euclidean distance cells can each be programmed individually in a feedback loop until the current owing when V in = V data is correct.
Centre positions accurate to better than 1mV have been achieved. To allow each row to be programmed individually, each kernel and multiplier can be isolated. The tail and bias currents in the parallel circuits can be turned o , so that currents from no other rows appear in the output. Device mismatches within the bump circuit will lead to the width and height of the bump being subject to slight errors. The variations in height can be compensated for automatically in the following multiplier circuits, and cause no particular problems. Errors in the bump width can either be tolerated, or compensated for if necessary by using chip-in-the-loop training to change the value stored in the Euclidean distance cell which sets the width. Hence oating-gate devices are not required in the bump circuit. By programming each of the multiplier circuits individually in a feedback loop, until the current supplied by each subcircuit is correct, we can combine the programming and trimming functions together. However, random mismatches and non-idealities in the non-oating-gate de- vices in the multiplier circuit still lead to the bias current not being subtracted perfectly. The e ect of this is to produce a small constant o set current which is dependent on the stored weight but is independent of the bump current. This can be seen in gure 8. This is not a serious problem, since the RBF de nition, Eq 1, contains a linear bias term which can be adjusted to compensate for the particular o set of each chip. Figures 10 and 11 show the output current of one class line before and after trimming. The voltages stored on the oating-gates of the multiplier circuits have been adjusted to reduce the height of the main bump slightly, to correct an error in the weight, and the unwanted bumps have have had the outputs of the associated multipliers reduced from 230nA to 40nA, making them much less prominent.
SCALING TO A LARGE SYSTEM
The results from the small test chip can be used to estimate the performance of a full-sized system. Conservative device sizes were chosen for the test-chip, with the oating-gate devices being based on 8 m 8 m MOSFETs. The minimum gate length used elsewhere was 5 m, to minimise short-channel e ects. The target process was the MOSIS-compatible 2 m doublepoly double-metal Orbit N-well CMOS process. We estimate that a 1cm square chip could contain an RBF network with 150 kernels, 32 inputs and 16 outputs. More kernels could be accommodated by a corresponding decrease in the number of input dimensions and/or outputs. The volatile storage mechanism used on the prototype would be unsuitable for the full-sized circuit, since the need to refresh the capacitors would reduce the time available for performing computations. Ideally, therefore, the data should be stored on the oating-gates in a non-volatile manner, for example using FowlerNordheim tunnelling 7]. Simulations suggest that such a chip would be capable of operating at up to 10 9 Multiply-accumulates per second, whilst consuming less than half a watt of power. This is signi cantly better than a TMS320C40, which has a maximum speed of approximately 10 7 Multiply-accumulates per second, and takes approximately 2W.
CONCLUSIONS
We have presented results from an analogue Radial Basis Function chip which is based around a compact Euclidean distance calculating circuit. Floating-gate devices are used to program the circuit, and to compensate for inter-device variations. Chip measurements con rmed the functionality of the circuits. Simulations suggest that a large-scale implementation of an RBF system using this architecture would be two orders of magnitude faster than a TMS32C40, whilst consuming only a fraction of the power.
