The 
Introduction
The core of the chip is a neurally inspired scalable (reconfigurable) array network for compatibility with VLSI.
The chip is endowed with tested auto-learning capability, realized in hardware, to achieve global task auto-learning execution times in the micro to milli seconds.
The architectural forward network (and learning modules) process in analog continuous-time mode, while the (converged, steady state) weightslparameters can be stored on chip in digital form. The overall architectural design adopts engineering methods from adaptive networks and optimization principles [ 1, 2] .
The designed chip can handle 16 inputs, 16 outputs. In addition there are inputs for control interface, synchronization and stand-alone programmability of the chip resulting in an approximate die area of 4000 X 6000 pm2 in a QFP-208L package 0-7803-7044-9/01/$10.00 02001 IEEE
I95
The 6 layer Copper (Cu) interconnect, single poly, 0.18 micron process enables dense connectivity and dense die area of this highly interconnected network resulting in a compact powerful engine. Moreover, the special low resistance and low capacitance electrical properties of copper permit the design to achieve the high connectivity while still managing precise distributions of resistive and capacitive loads. These properties enable one to predict performance and limit signal time-delays along the interconnect. The small feature size and the electrical interconnect properties for copper are enablers to the realization of such a powerful chip with dense interconnectivity .
Hierarchical Stfudure of the Chip

Figure 1: Arcitectural Overview of the Chip
The chip operates in four different modes: (i) .learn, (ii) (on-chip) store, (iii) program readwrite, and (iv) process which selectively combine its intrinsic analog and digital building blocks in a novel manner (see fig. 1 ).
Initially the system-level chip design was simulated and verified using SIMULINWMATLAB. All the building blocks were custom designed and extensively simulated using HSPICE (incorporating the UMC level 49 models).
The design was implemented in the 6 level Copper (Cu) interconnect, single poly, 0.18 micron process which is an enabler for the dense connectivity and dense die area of this highly interconnected network. The design was laid out and verified using Cadence Tools. More details of high-level design, circuit design of major blocks and chiplayout will be provided in the subsequent sections [8,9].
The resulting chip design requires no traditional programming or coding [2, 3] . In addition to novel architectural design, the hardware also performs the heavy computational burden by selectively realizing programmability as on-chip auto-learning modules. The resulting System-on-Chip operates on 1.5V power source and consumes approximately 1 mWatt of power.
Architectural Design
The design process was comprised of consecutive stages, based on a top-down definition of the chip. A general definition of the functionality and intended applications was created, and the development of the chip design was done in three different levels.
A high-level design, specifying the characteristics of the neural network to be implemented and the definition of its basic building blocks. A circuit level design, describing each of these blocks based on the copper technology with their corresponding simulations. Finally Q l~y o u t level design, where the actual chip layout is created and verified
3.
We present our stepwise approach for tailoring the BP algorithm so that it becomes suitable for VLSI implementation. For illustration in each case, we present the simulated results of the XOR problem, highlighting the effect of each modification on the performance of the algorithm. 
High Level
W ( 2 ) = q . NET'^) . ( ( D -y )~'~' . ~' 2 ) -a . w'') W ( l ) = q .
NET(^)). NET(*)). ( ( D -y ) . ~' 3 '
) . w"' . x"' -a . w '1' 
Modified BP algorithm with nonlinear multiplier.
This update law and the subsequent law considers the effect of the presence of resistive elements in hardware circuits and the multiplication using a Gilbert multiplier. Each case uses the following update law
where NET"' = tanh(W'"). tanh(X"-") aE aw BP Algorithm with non-linear multiplier The update laws are
tanh(tanh(tanh(y'(Nd2')). tanh(tanh(D -Y )
. tar~h(W'~)))). tanh(W(*)))). tanh(X('"-a.W('' 3.3 The Modified BP algorithm with nonlinear multipliers and the removal of the derivative function in all the hidden layers.
The update laws are W
. tanh( X ( I ) ) -a. W") 2) Removing the derivative function of the second hidden layer, the neural network could still converge. In fact, this can be easily verified mathematically.
3) When the derivative function in all hidden layers are removed, the neural network could converge, but in this case the training error is not zero but attains a small constant mean value. The update law derived in this case is still a gradient type law [7] .
For all presented simulation results, we used the same set of initial conditions for all the models i.e all initial conditions at zero. In general for different nonlinear systems, the same initial conditions may bring different training result. An initial weight set which results in a good training result for one nonlinear system will not necessarily ensure that it will result in a good training result for another nonlinear system.
Conceptual Chip Design
Presented below is a higher block level design of the chip. The initial and highest level design of each neural cell shows the central idea of the processing network and the learning network (Fig. 7) .
The processing stage is composed of 16 neurons built using vector multipliers and a sigmoid function. The multipliers use as operands an input vector and a weight vector. The input is common to all processing units, and the weights belong to each neuron. The scalar product is then applied to the non-linear function, resulting in the output of a neuron.
The learning stage works in a similar manner as the processing stage, but using different sources for the product. It receives the signals from the next stage, creates a new signal to be sent to any previous cell, and updates the weights according to the update law. Both the stages of learning and the processing networks are merged locally. To accomplish this, it was necessary to decompose the 17-D multipliers that constituted each network node into a set of 1-D multipliers.
On-chip memory is designed as local digital memory. It is therefore necessary to add a stage where the current analog Test Ouput value of the weight is converted into a digital value using an ADC, and then converted back using a DAC. The memory is built by using 5 data flip-flops. The update law, however, uses a capacitor (see Fig. 7 ) and 1-dimension (1-D) multipliers. These multipliers are also used in each neuron, to form the 17-dimension (17-D) multipliers.
To optimize the number of ADCs required for the conversion of the weight and still achieve 'good performance, an array of ADCs was designed away from the neural network. With this new configuration one ADC could be shared by the whole row of weights, reducing the number of ADCs to&. This design uses multiplexers, decoders, control logic for the store mode and the need of a clocked input to drive this logic. This clock also regulates the ADC operation, as it is designed to be successive approximations type. Please note that having a clock in this section does not imply that the neural network stops being asynchronous. For more detailed review of the chip architecture, kindly refer to [ 1,8,9].
Design and Layout of Components
There are a number of custom designed components for this chip. All the component circuits were designed on StarHspice using BSIM Level-49 models supplied by SRC/UMC. Avant's software was used for the schematic entry and waveform viewing. Initial layout of the subcircuits was carried out in Tanner Tools but the verification and LVS were performed using Cadence Tools. In this paper we restrict ourselves to presenting results on the more vital components of the chip. This includes a Gilbert multiplier, a wide transconductance amplifier and a comparator in this paper. In addition, demonstrative simulations for ADC and the vector multiplier are also provided.
Gilbert Multiplier
To implement the multiplication in the analog domain, a
Gilbert multiplier cell has been employed. The circuit diagram of the modified Gilbert multiplier is shown in fig.   8 . Assume that all transistors in fig. 8 are in saturation region, and are matched so that the trans-conductance parameters satisfy the equations PN = PMI = P M~ and P p = P s = P g =Pm = Pi1 The modified Gilbert multiplier takes the difference between two voltages (V3-V4) and multiplies that difference by a difference of two other voltages (VI-V2).
In the small signal range, the characteristic curve is approximately linear, with all four inputs carrying multiplication information. For the large signal range, the multiplier is non-linear but does not cause any instability. The multiplier layout is shown in fig. 9 , while the HSPICE simulated DC characteristics are shown in fig. 10 and shows four quadrant multiplication 
Wide Transconductance Amplifier
The wide-transconductance amplifier is used in multiple modes in this chip. It is used as the sigmoid non-linearity at the end of each row, and as a buffer for in-chip signal buffering.
For the transconductance amplifier, the differential-in, differential-out transconductance is given by the equation
Wide transconductance amplifier was preferred over the simple transconductance amplifier for its better characteristics against 0 input common-mode voltage range better inputJoutput voltage swing transistor sizdcurrent mismatch and hence commonmode gain
The designed amplifier was achieved by adding two extra current mirrors to the simple transconductance amplifier. By reflecting the currents of M1 and M 2 to the upper current mirrors, the output current is just the difference between 11 and 12, with the advantage that both input and output voltages can run almost up to V a and almost down to V,,, without affecting the operation of the circuit.
The output current, lo,,, in the schematic ( fig. 11) . is converted to a voltage value using a 2-CMOS active load (not shown in the schematics). The complete layout is shown in fig. 12 with the simulated transfer characteristics shown in fig. 13 . 
ADC Operation
The simulation results below illustrate a simulation of conversion of an analog voltage of 0.85V to its digital equivalent by the successive approximation ADC internal to the chip. In this simulation the conversion process takes approximately 3ps to convert, implying that all the converged analog weights will be converted to their digital eauivalent in aDDroximatelv 3us x 17 = 51us. 
Vector Multiplications
Presented below are the transient characteristics of multiplication of two time domain sinusoids. The third waveform presents the result, which is the cascaded result of the current output of seventeen multipliers collected in current busbars and converted to the voltage domain at the end using a 4-CMOS active load. 
Overall Chip Layout & Interconnects
The figure below shows the array structure of the implemented chip, best possible fit of the building sub-cells were tried in order to achieve a highly dense building block structure. A hierarchical routing assignment was made for the available 6 metal layers to achieve the required dense connectivity. For more details see [ 11. 
