Abstract
Introduction
Although neural networks (NNs) compute exceptionally parallel manner, this valuable characteristic has not been exploited as successfully as their learning capability. In case of fully parallel hardware, the processing time is independent of the amount of data to be processed by the network. Furthermore only a few computing steps have to be performed in serial manner, therefore computation time can be extremely short. This work concentrates on the benefits of unique parallel processing. One of the most challenging tasks of hardware realisation of neural nets is the inner product operation. Since it consumes too large chip area with digital circuitry, fully parallel digital architectures do not exist for large NNs. If high precision is not required, the compact and high speed analog approach has great advantage. With analog technique low cost, low power dissipation, single chip architectures of complex neural networks are possible. Although such systems ,are commercially available [4] , offering as low as several microsecond processing time for as large as 128 dimensional input vector, it is almost impossible to find any solution for application domain demanding tens of nanoseconds processing delay for similarly large input vectors. The integrated circuit presented here is intended to provide the high computing performance needed for such applications. 
System description
The implemented NN architecture is shown on figure 1. It is a fully interconnected feedforward structure with 70 analog inputs, 6 hidden layer neurons and one output neuron. The neurons are inner product type, and have sigmoid-like activation function. The neural signal processing is fully analog, yielding high speed operation and compact circuitry for inner product operation. The synaptic weights are stored digitally on static RAM (SRAM) cells, to enable simple programming even from a personal computer. Digital weight storage also helps to eliminate weight decay and increases reproducibility. The SRAM cells are located nearby each synapse circuit to minimise wiring for communication.
Downloading the approximately 3.5 Kbit synaptic weight and configuring information is relatively slow compared to normal operating speed of the NN circuit, and takes a few milliseconds. The chip block diagram is shown on figure 2.
The largest area is occupied by the 70x6 synapse array, including the 70x6 differential voltage to current converters as synapses anid 70x6~5 SRAM cell for weight storage. Each 5 bit synaptic weight can be selected, red and written by the row-, column decoders and read, write circuitry. A programmable voltage source array is located nearby the synapse array which enables programmable biasing and control of the gain of neural activation functions. Figure 3 shows the analog circuitry along the signed signal path of figure 1. The processing delay of the NN pattern classifier is merely the delay introduced by this circuitry, since the rest of signal paths are parallel. The synapse circuit is a differential pair formed by T1 and T2, with a single ended voltage input and a reference voltage, which is equal for all the synapses in the NN. The outputs of synapses are differential currents, which are summed on the (differential) summing node of the corresponding neuron. Variable synaptic weight is achieved by programmable current source for the differential pair. The current source transistors T7, T8, T9, T10 are properly sized to deliver current with respect to the smallest, or unity current, according to ascending powers of 2. Any combination of these currents can be obtained by using the switch transistors T3, T4, T5, T6. The sign of the synapse can be varied by interchanging Vin and Vref, using an 8 transistor switch, which is not shown in figure 3 . Simulated and measured synapse characteristic is shown in figure4a and figure4b. The sum of synaptic currents is transformed to voltage by the load transistors T12 and T13. T11 controls the differential load. The saturating, sigmoid-like activation function, shown in figure 5 , is obtained by the saturating characteristic of the second layer synapse, rather than by a separate nonlinear circuit.
Circuit operation

Off-chip Vdd
This method simplifies the circuitry and increases speed. The second layer synaptic weight is obtained by the number of parallel connected active synapse stages, formed by T20, T21, T22, T23. This stage is activated by the switch transistor T20. Every switch transistor of the circuit is wired to a separate SRAM cell. There are alto gether 3750 SRAM cells on chip. The current summing node has intrinsically large parasitic capacitance, since all the synapse outputs and the common load are connected to this node. To increase the speed of the circuit, the node impedance has to be kept low. The consequence of low node impedance is a small voltage swing on the summing node. A voltage amplifier stage scales this voltage properly for the second layer synapse stage.
Application in high-energy physics research
The feasibility of a single chi NN photon trigger for used a database containing "TGT calorimeter preshower" [3] for more detailed discussion.
Histogram
0
Analog network output 1 (arbitrary unit) Figure 6 . Histogram of analog classifier responses (test set, optimal) Figure 6 shows, the histogram of analog classifier outputs over the test set. We can see, that although there is an overlap between the two classes, photon and pion data is clearly separated. Above or under a certain network output the data is classified "photon" or "pion" respectively. We call this value of network output "decision threshold. Figure 7 shows the classification efficiency as a function of the decision threshold. For example if we choose the decision threshold, where the percentage of misclassified photon patterns equals the percentage of misclassified pion patterns, the correctly classified data is 96% for both classes. When decreasing the decision threshold to correctly classify 99% of photon data, 15% of pion data is misclassified. Figure 6 and figure 7 show performance in case of ideal hardware. Simulations have been made to examine the non-ideal effects, introduced by our analog NN hardware. Noise, syinapse non-linearity, weight discretization and the effect of sigmoid-like shape of the activation function have been taken into account. Figure 8 and figure 9 show the results. The overlap between the two classes increases compared to the ideal case. In contrast to the 96% correctly classified data at the crossing of curves i n1 figure 7 , we get 93% with our hardware. The 75% increase of misclassified pattems is mainly due to the applied simple discretization technique. We expect even better performance with careful weight discretization. The measured transient response is plotted in figure 10 . Both the stimulus and the NN response are shown on the figure. 
Pat tern& 33K
Multiplicationds
(not inner product neurons) Table 2 compares the parameters of available high two Intel chips implement more neurons [4] , [7] , it was speed integrated circuits for pattern classification. The shown, that more hidden layer neurons doesn't result in comparison is not comprehensive, because only those better classification performance for the presented parameters are shown which were crucial for the application and for the similar (HEP) application presented high-energy physics application. Although the described in [ 11.
50000K
1.3C;
Conclusions
Although the chip does not take advantage of a stateof-the-art technology [5] , [6] 
