Abstract| The architecture, implementation, and applications of a special purpose neural network processor are described. The chip performs over 2000 multiplications and additions simultaneously. Its datapath is particularly suitable for the convolutional topologies that are typical in classication networks, but can also be congured for fully connected or feedback topologies. Resources can be multiplexed to permit implementation of networks with several hundreds of thousands of connections on a single chip. Computations are performed with 6 Bits accuracy for the weights and 3Bits for the neuron states. Analog processing is used internally for reduced power dissipation and higher density, but all input/output is digital to simplify system integration. The practicality of the chip is demonstrated with an implementation of a neural network for optical character recognition. This network contains over 130,000 connections and is evaluated in 1 ms.
I. Introduction N EURAL networks have demonstrated their capabilities in numerous applications, including pattern classication, speech recognition, and control [1] , [2] , [3] , [4] , [5] , [6] , [7] , [8] . However, the computational requirements, data rates, and size of neural network classiers severely limit the throughput that can be obtained with networks implemented on sequential general purpose computers. Better performance is achieved with special purpose VLSI processors that employ parallel processing to increase the throughput.
Speed and data rates are not the only challenges faced by specialized hardware designs for neural networks. Because of the rapid progress of neural network algorithms, processors must be exible enough to accommodate a wide variety of neural network topologies. Moreover, the size of networks under investigation is increasing as progressively more dicult tasks are solved with neural networks. Networks with several tens or hundreds of thousands of connections are typical for high-accuracy pattern classiers, and this number is expected to grow further. To be economical, such networks must be implemented on a small numberof chips. Moreover, the high-performance parallel-computing unit must be matched with an equally powerful interface to avoid bottlenecks.
Meeting all these requirements calls for a tradeo. On one side, the hardware should be as general as possible to support a wide range of applications. At the same time, efciency dictates a design that carefully matches neural network characteristics. In Section II, hardware requirements Manuscript received May 5, 1991 ; revised August 16, 1991 . This work was supported in part by the U.S. Army SDC.
The authors are with AT&T Bell Laboratories, Holmdel, NJ 07733. IEEE Log Number 9103700.
of neural networks are examined, with special emphasis on classication applications. Issues considered include the regularity of neural network architectures, the feasibility of sharing common hardware for multiple functions, and the eect of limited accuracy computations. The architecture of the neural network chip is described in Section III, and implementation aspects of the functional blocks are presented. A mixed analog and digital design has been chosen to exploit the low computational accuracy typically required by neural network algorithms, and at the same time address system integration issues, which call for a digital interface.
Experimental results and performance gures are summarized in Section IV. The exibility and speed are documented with a variety of sample neural network architectures and the corresponding update rate that is achieved with the chip.
The practicality of the design is demonstrated in Section V with results from an implementation of a neural network for handwritten digit recognition with over 133,000 connections. The network ts on a single chip and is evaluated at a rate in excess of 1000 characters per second, which constitutes a speedup of two orders of magnitude over a DSP based implementation. Despite the low resolution of the chip, the error rates of the neural network processor and DSP implementation are very similar at 5.3 % and 4.9 % respectively. For comparison, the measured human performance on the same database is 2.5 % errors.
II. Neural Network Hardware Requirements
Articial neural networks that solve dicult problems in areas such as speech recognition and synthesis, or pattern classication, consist of thousands of neurons with tens or hundreds of inputs each. Every neuron computes a weighted sum of its inputs and applies a nonlinear function to its result. Architectural parameters, such a s t h e n umber of inputs per neuron, and each neuron's connectivity, vary considerably within a network, and from application to application. A special purpose neural network processor must be exible and powerful enough to accommodate a wide range of applications. At the same time, the requirements must be carefully balanced and the special nature of the task exploited to bring an ecient implementation within reach of today's technology. In this section, the characteristics and requirements of neural networks are examined and possible VLSI implementations investigated.
Two phases of operation are distinguished in many neural network applications. During the learning phase, the topology and weights of the network are determined from a labeled set of examples using a rule such as backpropagation [1] , or a network growing algorithm [9] . In the subsequent retrieval or classication phase, the network parameters are xed. Patterns are now recognized by the network based on information stored in the architecture and weights during training. Since the computational and infrastructure requirements (e.g. training database) during the learning phase are considerably more complex than those for classication, eciency considerations call for separate hardware for learning and retrieval. Network parameters determined during learning are downloaded into processors that are specialized for the classication task. This paper focuses on the latter problems. This approach contrasts with implementations of neural network processors with on-chip learning [10] , [11] . Those circuits are not suitable for the pattern recognition problems investigated here, both because of limitations of the training algorithms implemented on these chips, or because of the limited size of the network that can be trained.
The basic operation performed by a neuron during classication is a weighted sum, followed by a nonlinear squashing function f, t ypically a hyperbolic tangent or approximation thereof:
The inputs x i of the neuron are usually referred to as connections, and the w i as weights. Each input is either tied to the output y of another neuron, or to an external input. Optionally, a bias b is added to the weighted sum.
The total number of connections in neural networks for applications such as handwritten character recognition is several ten or hundred thousand [12] . Networks that solve more general problems, for example recognition of entire words instead of isolated characters, require even larger numbers of connections. The speed requirements of typical applications call for a few tens to several thousands of classications per second. For each classication, one multiplication and one addition must be evaluated for every connection. This translates into up to a few billion multiply-add operations per second. Such computational power can be achieved only with a parallel implementation, where several connections are evaluated concurrently.
The most general network topology permits connections between any two neurons. Such a high degree of (possible) connectivity, combined with the need for parallel processing results in enormous hardware requirements, and therefore calls for a compromise. Usually, the neurons in a network are arranged in layers. Each l a y er receives inputs only from neurons in the previous layer. Layers may be fully connected, that is each neuron may be connected to every neuron in the preceding layer. Often, however, local connectivity is used in order to express knowledge about the problem (e.g. geometric relations, such as neighborhood of pixels in an image) in the network architecture and thus improve the recognition performance [6] . For example, the fact that some pixels in an image are adjacent to each other can be built into the network architecture by constraining neurons to receive inputs only from neighboring pixels. In a fully connected topology, such information must be derived from the training set during the learning phase, usually with only partial success.
A neural network processor could be designed to implement only networks with fully connected topology. Local connectivity w ould then be realized by simply setting the weights of unused connections to zero. Since, in typical neural networks, the ratio of such unused connections to actual connections is easily one hundred, such an implementation is unacceptably inecient. Support for local connectivity adds circuit complexity, but overall a tremendous saving in chip area is realized because of the reduced number of connections, more than compensating for the added complexity.
Another challenge for a compact hardware implementation of a classier is the amount of memory that is needed for storing several tens or hundreds of thousands of weights. Fortunately, the weights of many neurons in important connection topologies, including time-delay or feature extraction neural networks [4] , [6] , [7] , are identical. In those architectures, the connection topology corresponds to a one or higher dimensional convolution, followed by the nonlinear squashing function. Such a structure can be realized with a single neuron that is time-multiplexed, with a corresponding saving of storage and computing devices.
Further optimization of the hardware complexity i s p o ssibly by matching the computational accuracy of the processor to the requirements of typical neural networks. Both experience and theory [13] indicate that neural network classiers can be designed to be insensitive t o l o w resolution arithmetic. Experiments with character recognizers show that the recognition performance remains virtually unchanged when the inputs and outputs of the neurons are quantized to 3 Bits, and the weights to approximately 5 Bits. Higher resolution is required in the last layer for the rejection of ambiguous or unclassiable patterns. Since, in typical neural networks the output layer contains only a small fraction of the total number of connections, system complexity can be reduced by e v aluating those connections on a dierent processor with higher accuracy.
III. Architecture and Implementation
A. Overview Figure 1 shows the block diagram of the neural network processor. Data enters the chip through a shifter [14] . This interface is well suited to the convolutional type network topologies discussed in the section above. The matrix multiplier computes the dot products of input and weight v ectors. The weights are programmable and stored locally in the multiplier. The nonlinear squashing function is evaluated by the neuron bodies.
The chip uses analog computation internally to capitalize on the low accuracy requirements of typical neural network algorithms. All input and output is digital, however, to simplify system integration. The overhead for D/A and A/D conversion is negligible since both functions are combined with the neural computation. The D/A converters serve simultaneously as multipliers, and the A/D converters in the neuron bodies evaluate the nonlinear squashing function. Neuron inputs and outputs are quantized to 3 Bits, and weights are represented as 6 Bit quantities; both are represented in signed magnitude format. All circuits are designed such that this accuracy is maintained across dierent c hips and fabrication runs. This is particularly important in this design since learning is performed o-line and trimming individual chips is therefore not practical. The shifter is 64 words (3 Bits each) wide and reads up to four inputs in each cycle. The use of a shifter limits the number of pins required, and provides direct support for convolutional type network topologies and multiplexing of neurons with identical weights. This is achieved by scanning the input serially past the matrix multiplier. Data loaded into the chip can be buered temporarily in a le of 16 vector registers to reduce the required input bandwidth, and to evaluate neurons with more than 64 inputs.
The matrix multiplier consists of 4096 cells that perform a four quadrant m ultiplication of a digital neuron input and an analog weight that is stored locally using a multiplying D/A converter. The contributions from individual multipliers are accumulated in a current summing wire. The aspect ratio of the matrix multiplier is programmable to support neurons with 16 to 256 inputs. This conguration can be changed without overhead to permit ecient e v aluation of multiple network layers with dierent topologies on a single chip.
The neuron bodies rst scale the outputs from the matrix The implementations of the matrix multiplier and neuron bodies are discussed in more detail below.
B. Matrix Multiplier
The matrix multiplier consists of 4096 individual multiplier cells that are arranged in eight blocks of eight v ectors containing 64 multipliers and a variable bias (Figure 2 ). The input to each block is held in a latch that is loaded from the shifter. The multiplexer combines the outputs of one to four vector multipliers and routes the sums to the neuron bodies. This arrangement supports neurons with 64, 128, or 256 inputs. Figure 3 shows the diagram of a multiplier cell with the local weight storage. Transistors M1 to M3 and switches S1 to S4 constitute a multiplying D/A converter (MDAC) that computes the product of the magnitude of the digital input (X0, X1) and the weight. The magnitude of the weight i s stored as a charge packet on capacitor C and the gates of the MDAC current sources [15] , [16] . Proper operation requires that the gate capacitances of M1 to M3 be constant. This is ensured by dumping the current i n to VSS when it is not needed at the output. The sign of the product is computed digitally by a n X OR gate that controls switches S5 and S6. These connect the MDAC output either directly to the YSUM, or through the mirror M4, M5. The summing wire YSUM accumulates the contributions from all neuron inputs. It must be held at a constant potential of approximately 1 V by the neuron bodies to ensure proper biasing of the MDAC and current mirror outputs.
Despite increased area, a local current mirror in each multiplier cell has been preferred over a solution where positive and negative contributions from individual cells are summed on separate wires and the dierence taken in the neuron body (cf. [17] ). This latter solution is not viable for neurons with a large number of inputs because it is difcult to suppress quickly and accurately the large common mode current in the two summing wires.
The capacitive storage of the magnitude of the weight must be refreshed periodically to compensate charge leakage. This is accomplished by the programming circuit shown in Figure 4 . The desired current is generated by a D/A converter and forced into the load transistor M ref to produce the correct programming voltage which is then copied onto the storage capacitor of the cell. This arrangement ensures a wide programming range from 0.8 V to 4 V and consequent good accuracy. Simulations and measurements show that the error resulting from mismatch b e t w een M ref and the current sources in the MDAC is negligible compared to the quantization error of the neural network chip. Two refresh D/A converters are integrated on the chip to achieve a 110 s update cycle for all weights.
The refresh D/A converters and the A/D converters in the neuron bodies share a common on-chip reference current source. Since chip inputs and outputs are digital, the arithmetic results are independent of the reference, which is designed to maximize the dynamic range of internal analog signals independent of process parameter variations. In particular, the voltage range of the weight storage capacitors is chosen to be as large as possible to minimize errors due to charge leakage. The nominal full-scale current p e r MDAC is approximately 50 A.
C. Neuron Bodies
The neuron body circuit, illustrated in Figure 5 , scales the current output from the matrix multiplier and converts it to a 3 Bit digital representation. The nonlinear squashing function is a side-eect of the overload characteristics of the converter.
A successive approximation converter has been chosen over a ash converter to avoid the need to create several copies of the signed output current from the vector multiplier. A programmable scaler in the reference current generator permits the slope of the nonlinear function to be adjusted in eight steps from 1/16 to 1/2. Depending on the setting, the full scale output of the neuron is generated when at least two to sixteen multiplier cells produce their maximum output current. The comparator at the input of the A/D converter senses the sign of the sum of the current from the vector multiplier, and the D/A reference current. The design of the multiplier cells mandates that the output is terminated into a l o w impedance voltage source of about 1 V. Conversely, a high impedance load is preferred in order to reduce the sensitivity requirements on the comparator. These conicting requirements are reconciled in the circuit shown in Figure  6 , where the requirements are met individually in two separate clock phases. The vector multiplier is represented as a variable current source with a parasitic capacitance and resistance. Simulations indicate that these parasitics vary between 3 pF and 40 pF, with a resistive component that can be as low as 1.8 k, depending on the numberof multiplier cells connected to the vector.
The current comparator operates as follows. During clock phase 1, the vector multiplier output is connected to a voltage source V Bias 1 V to minimize the load impedance.
It consists now only of the sum of the impedance of the voltage source and the switch resistance. During phase 2, the comparison phase, a high load is desired. This is achieved by disconnecting the comparator input from V Bias . Now, the parasitic impedance acts as a load. Since its time constant is usually larger than the duration of the clock phase (25 ns), this impedance is mostly capacitive. The current charges the parasitic capacitance and causes a voltage difference to develop at the input of the comparator circuit which is detected with a high gain amplier followed by a latch.
The complete schematic of the comparator is shown in Figure 7 . It consists of a dierential amplier with symmetric load, followed by t w o i n v erters that amplify the signal to logic levels. In the rst clock phase, all switches are closed and the input is connected to V Bias . In this phase, the current o wing from the dierential stage into inverter I1 is close to zero, since M1 and M2 have equal gate voltages, and the drain-to-source voltages are forced to be equal by the cascode transistors M3 and M4. These transistors together with the cascode current mirror ensure a low comparator oset. In the second clock phase, all switches are opened and the comparator detects the sign of the voltage dierence that builds up at its input. The diode at the input clamps the voltage to a maximum of about 1.7 V to speedup discharge during the rst phase.
IV. Experimental Results
The neural network chip has been fabricated in a single poly, double metal 0.9 m CMOS technology with 5 V power supply. The die measures 4:57 m m 2 ( Figure 8 ) and is packaged in a 96 pin grid array. The matrix multiplier (center) and the shifter and register le are pitch-matched in order to avoid the need for extra routing channels. The typical operating current is less than 100 mA, but reaches 250 mA when all weights are programmed to their maximum value. The chip contains over 180,000 transistors. Despite the lack of redundancy, the yield is high since many devices are larger than minimal size, and since the circuits are insensitive to parameter variations.
Testing is performed in several steps. The overow of the shifter is connected to pins and used to check the shifter and vector register le for correct operation. A test pin provides access to the summing wire of one of the vector multipliers. This feature has been used to functionally test the multiplier cells, neuron bodies, and refresh D/A converters. Finally, statistical techniques are used to determine the overall computation error of the chip, which is comprised of the nonidealities in the multiplier cells, the neuron bodies, and the weight storage and refresh circuits. This error cannot be measured by simply comparing the chip output with a perfectly accurate simulation, since the quantization errors from the ADCs in the actual and simulated neuron body are correlated. Instead, the quantized output of the chip is compared to the simulation result before quantization. Ideally, this signal contains only the quantization error with approximately uniform distribution between plus and minus half the quantizer step size, as is indicated by the simulation result shown in Figure 9 . The measured distribution is wider due to inaccuracies in The chip executes three instructions, CALC, SHIFT, and STORE, to perform computations, load data from an external data source, and to transfer data between the shifter, register le, and vector multiplier banks. The CALC instruction takes four cycles of 50 ns, the other two operations execute in a single clock cycle concurrent with an ongoing CALC instruction. In 200 ns the chip can, for example, load eight states and store them in a register and two latches, and evaluate the dot product and nonlinear function of eight v ectors with 256 components each. Programming is supported by an assembler and a code generator. Table I summarizes the features of the chip.
Programmability is one of the key features of the neural network chip. Table II lists a selection of network topologies that can be implemented and the achieved performance in each case. The chip processes networks with full or sparse connection patterns of selectable size, as well as networks with feedback at a sustained rate of over 10 9 connection updates per second. An external feedback path must be provided for multi-layer and Hopeld neural networks.
V. High-Speed Character Recognition Application
Speed, capacity, and programmability are important aspects of neural network hardware. Their practical relevance, however, must be proven on a real world application. In this section, the implementation of an optical digit recognizer [6] , [18] on the neural network chip is described. This network has been trained with the backpropagation algorithm [1] on a workstation with oating point arithmetic to recognize handwritten digits from a 20 by 20 pixel image. The classication error rate on a test-set consisting of 2000 handwritten digits is 4.9 % miss classications, compared to a human performance 2.5 % on the same data. Figure 10 illustrates the architecture of the network. The more than 3500 neurons with a total of over 133,000 connections are arranged in ve layers. The rst four layers employ a t w o dimensional convolutional type topology with various kernel sizes and subsampling factors. The last layer is fully connected. This topology has been chosen to maximize recognition performance and classication speed of an implementation on a oating point digital signal processor (DSP-32C [19] ). Special steps are necessary to adapt the network to the low resolution of the chip. Simple quantization of all weight values from oating point accuracy to the 6 Bit signed magnitude representation of the chip results in an unacceptable loss of accuracy. However, experiments reveal that the computational accuracy provided by the chip is adequate for all but the 3000 weights in the last layer of the network. This last layer is retrained with quantized data obtained from the chip to eliminate performance degradation. After retraining, classication error rate on the testset is 5.3 %, compared to the original 4.9 %. This result is obtained consistently with dierent c hips for which the last layer has not been retrained individually. Figure 11 shows the input, output, and internal states of the neural network for a sample input that has been processed by the neural network chip. The rst four layers of the network with 97 % of the connections t on a single neural network chip. The remaining 3000 connections of the last layer are evaluated on a DSP32C digital signal processor. The throughput of the chip is more than 1000 characters per second or 130 MC/sec. This gure is considerably lower than the peak performance of the chip (5 GC/sec.), a consequence of the small number of inputs of most neurons in the network for which the chip cannot fully exploit its parallelism. Nevertheless, the chip's performance compares favorably to the 20 characters per second that are achieved when the entire network is evaluated on the DSP32C. The recognition rate of the chip is far higher than the throughput of the preprocessor, which relies on conventional hardware. Improvements of both the recognition rate and accuracy can be expected when the network architecture is tuned to take full advantage of the power of the neural network chip.
VI. Conclusions
Neural networks are attractive for pattern classication applications but suer in practice from the limited speed that can be achieved with implementations based on classical sequential processors. This problem can be overcome with highly parallel special purpose VLSI circuits. While a fully parallel implementation of suciently large networks is currently not feasible, adequately high performance can be achieved with an architecture that exploits the limited connectivity and weight sharing that are typical for pattern classiers. This has been demonstrated with a neural network classier with over 133,000 connections that has been implemented on a single neural network chip performing more than 1000 classications per second. The availability of fast special purpose hardware for large applications permits exploration of new neural network algorithms and problems of a scale that would not be feasible with conventional processors.
