Introduction
The single Neural System-on-Chip (SoC) design impacts several domains of critical applications that include nanoscale biotechnology, automotive sensing, control and actuation. wireless communications, feature extraction and pattern matching etc. The core of the chip is a neurally inspired scalable (re-configurable) array network compatible with VLSI. The chip is endowed with tested auto-learning capability, realized in hardware. It is capable of handling 16 inputs, 16 outputs. There are separate 16 inputs for applying training patterns. More inputs include control interface, synchronization and stand-alone programmability of the chi^ resulting in an approximate die area of 4000x6000 pm-. The resulting SoC consumes approximately 1.5mWatt of power at a voltage supply of 1 SV.
The architectural forward network (and learning modules) process in analog continuous-time mode. Whereas, the (converged, steady-state, andor programmed) weights can be stored on chip in digital representation. The local static memory solution was adopted to counter the problem of weight decay in generic analog implementations of neural networks. Further, it ensures autonomous high-speed operation for the designed SOC. The overall architectural design adopts engineering methods fiom adaptive networks and optimization principles [3,4].
The 6-layer Copper (Cu) interconnect, 0.18 micron-single poly process enables small feature sue and high connectivity, resulting in dense die area of this highly interconnected network resulting in a compact powerful engine. Moreover, the special low resistance and low capacitance electrical properties of copper permit the design to achieve the high connectivity while still managing precise distributions of the resistive and the capacitive loads. These properties enable better prediction of performance and also limit signal time-delays along the interconnect. 
Architectural Design
The design process comprised of consecutive stages, based on a top-down definition of the chip. A general definition of the functionality and the intended applications was created.
The development of the fmal chip design was achieved by meticulously working through three different design stages. 
High Level Design: Modified BP Algorithm
For implementation on VLSI, we tailored and tested the BP algorithm so that it became convenient for VLSI implementation. For more details. see [1, 5] .
Using simulations, we have been able to verify that the network is able to converge and operate after incorporating the following changes in the modified BP Algorithm.
Replacement of the ideal linear multiplier model by the realistic nonlinear VLSI multiplier model.
Removal of the derivative function in all hidden layers
Note that as a result of above modifications, the training error is not zero but attains a small constant mean value. However, the update law derived in this case is still a gradient type law [3] .
Chip Floorplan
The Chip is designed to be highly modular. The concept of this modularity was infused into the design right from the conception phase and realized into a granular structure of the chip as the design progressed.
The whole chip is composed of four cascaded building blocks and their interconnection. In additioq there are some global digital logic control elements (see Fig. 1 ). The fvst and the last of these blocks are routed to the padframe of the chip, as the input and the output layers of the neural engine.
Structure of the Building Block
The main building block of the chip comprises of a 16x18 array of building cells. The fvst 16x1 cells are the digital cells, while the remaining 16x17 array is formed of synaptic cells. This array of synaptic cells on the output side is padded by another column of buffers for signals to be connected to other building blockslpadframe. This stage also includes difference amplifiers used for determination of error, i.e. difference between target inputs and block outputs for tuning the local weights. An overview of this structure is provided in Fig. 2 . The digital cell array receives the digital supervisory signals directly fiom external pins plus some global synchronization signals generated within the chip. These signals are interpreted and appropriate logic signals for the control of synaptic cells are generated. The synaptic cells for the purpose of control are addressed coded in rows and columns. This allows for a mechanism of parallel management of building block resources as well as chip level resources. Further details will be discussed in subsequent sections, also see [1, 2, 6] .
The Synaptic array can be decomposed into cascaded processing stages. Each processing stage is composed of 16 neurons built using (x17) synaptic cells and a sigmoid function. Current bus bars are used to collect output currents fiom each cell in a processing stage. These bus bars run horizontally and vertically for common rowlcolumn outputs. Separately designed sigmoid functions and CMOS linear resistors are used to convert these currents to voltages.
The Synaptic Cell:
Each synapse cell is composed of three analog Gilbert multipliers, a storage capacitor, a linear resistor, a set of transmission gates and 5 data flip-flops. and error signal ei to determine the updated weight. The current-to-voltage conversion for this multiplier is done locally using a MOS resistor. In the sfore mode, the locally stored weight value, using capacitor C,, is converted to its digital representation. This conversion takes place in the corresponding digital cell. The converted weights are stored locally in the 5-bit memory. In the processing stage. the Multiplying DAC (MDAC) converts the stored digital bits to the equivalent weight representation. At this stage the polarity of signal p is adjusted so that the output of the MDAC is used as the local weight instead of the charge stored on the capacitor. The R signal is connected to reset of all local memory of data flip-flops, enabling the local memory to be reset independent of digital cell operation. The signal Ck, is the clock to the data flip-flops for storage of converted weights. 
Multiplying DAC (MDAC)
with the EN-C1Z signal connected to the transmission gate, provide this cell the capability to program the weights. The signal EN-ClZ is generated by another set of chip-level decoders, which are supplied with the row/column selection during programming. 
Design and Layout of sub-cells
All the components for this chip were custom-designed. All the component circuits were designed on Star-Hspice using BSIM Level-49 models supplied by SRCILTMC. Avant's software SUE and META WA VES was used for the schematic M-DAC is used for Digital to Analog conversions in the synaptic and digital cells of the designed system-on-chip. The reference current, controlled by Vi., provided to the M-DAC is 0.8 pA. Each successive current multiplying stage i is designed so that the current through it is 2' times the reference current. The input to the M-DAC is a 5-hit Digital value d4-do. These hits e n a h l e l i h i t a multiplying stage.
The total current through the multiplying stages of MDAC reflected a current mirror is used as the output. Upon conversion to voltage equivalent using a MOS linear resistor, this Digital value is mapped to an analog voltage between 0.63V to 1.1V at the output. Therefore the resolution of the M-DAC is 14. 
Data Flip-flops
Data flip-flops are used as registers in ADCs and also as local digital memory in each synaptic cell. It is made up of four NAND and one NOT gates. This two stage structure was adopted so that the transitions in the D signal do not affect the flip-flop output directly. The configuration used (see Fig. 7 ) uses fewer numbers of transistors when compared to a typical master-slave setup, yet providing adequate performance and isolation. The maximum delay for the output of the flip-flop to go from high to low is 130 ps and maximum delay from low to high is 70 ps. There are some minor artifacts in the output as the input changes along with the applied clock. However, the maximum glitch is less than f 0.1V. All The signal inputs, network outputs, training inputs to the neural network can be either analog or digital. The biases and the reference signals are used to tune the performance characteristics of the learning elements. These signals will be generated using a card from IC Tech Inc. which is primarily capable of a high number of digital iio. in addition it can also supply a number of accurate static voltages generated using digital potentiometers and isolated lines. The digital/logic control signals BO-B4, CO-C4. SO-S3. INO-IN3 will be generated either by the board above or National Instruments' Labview based digital iio cards. By design, the neural system-on-chip (nicknamed Microlearner) can accept and process signals with bandwidths greater than 100 MHz, this limit cannot he tested by the current setup. Therefore, after performing some initial validation of the chip operation, the platform for testing will be changed. The testing results for the chip will be reported it
