Abstract-An analog versatile neuroimage processor (VNIP) architecture is proposed here. VNIP can process various types of neural network and image processing structures, without any hardware modification. The structure allows unlimited expansion of network size and the compensation of process variation. The proof-of-concept chip is implemented, using a combination of continuous-time multiplier and switched-capacitor techniques. The throughput is 12 210 6 synapses/s 1 mm 2 and the energy consumption is 10 09 J/synapse. A test chip was fabricated, using a 1.2-m double-poly CMOS process and tested, verifying the flexibility and expandability of the architecture.
I. INTRODUCTION
N EURAL networks are structural approaches to solve complex problems that are difficult to solve analytically. The paradigm of a neural network is often inspired from a biological system. The common aspect of neural networks and image processing is the fact that both require massive parallel computation of matrix multiplication. Although the recent developments of high-performance digital signal processors have speeded-up the computation and have achieved, in practice, almost real-time processing, their high cost, large physical size, and power consumption often limit the usage of such systems in many every-day applications and, in particular, in portable applications.
Several analog implementations of parallel systems have been reported [1] . The analog implementation has the advantage over the digital system in the physical size and in the power consumption. Unfortunately, the application-specific structure of many analog implementations limits the applications range of the developed analog system. Several analog general-purpose processors are reported in [2] - [7] . However, they still have a limited application range and network size. In practice, most applications require the use of a large network size that cannot be implemented as a fully parallel structure on a single chip, even with an analog implementation. The modular approach [8] - [12] , implementing a large system with the array of small subsystems, has a limit due to the high cost of reliable analog interchip communication. Manuscript It is very difficult and involves a high cost to build a practical fully parallel analog system, without any significant breakthrough in fabrication technology. Therefore, an analog versatile neuroimage processor (VNIP) is proposed. It should be designed to be:
• expandable without requiring high-cost nor complex interchip connections; • capable of process variation compensation;
• Flexible to implement a number of paradigm architectures; It is an analog pseudo-parallel system with maximum flexibility, expandability, and computation efficiency. The proposed VNIP allows correction of process variation as well.
The structures of neuroimage processing networks are categorized in Section II. In Section III, the VNIP architecture is proposed. Section IV presents several application examples. The circuits and chip-test results are provided in Sections V, and VI, respectively. Section VII provides the conclusions.
II. NEUROIMAGE PROCESSING NETWORKS
The major computing element of a neural network is called a neuron [1] . It performs a sum-of-product computation and a nonlinear mapping expressed as (1) where is the output of a neuron, is the input, is the weight associated to the , and is an application-dependent nonlinear function. The structure of a network is determined by the nature of the problem to be solved.
The networks can be categorized by the dimension of the data. The one-dimensional (1-D) network is expressed as (2) where is the output of the th neuron and is the weight from the input to the th neuron, as shown in Fig. 1 . The 2-D (2-D) network is expressed as (3) where is the output of the neuron , located at the th row and the th column, is the input located at the th 1057-7122/99$10.00 © 1999 IEEE row and the th column, and is the weight from the input to the neuron . In many applications, the weight is shift invariant and (3) can be expressed as (4) where and represent the vertical and horizontal distances from the neuron to a input node, respectively. The and are vertical and horizontal neighborhood sizes, respectively. A pictorial description is given in Fig. 2 .
The neuroimage processing structure can be also grouped into a feed-forward multilayer network or a recurrent network, as shown in Fig. 3 . The discrete-time recursive network description is equivalent to a multilayer structure, from the computational point of view. The th iteration corresponds to the th layer.
III. PROPOSED PROCESSOR ARCHITECTURE
The basic architecture is a matrix-vector multiplication processor. The additional signal routing provides flexibility and expandability. The architecture consists of an array of rows, as shown in Fig. 4 . Each row consists of main memory, a multiplier, an accumulator , an output buffer memory (OBM), and a multiplexer. At row , the multiplier receives one input from a main-memory cell. In the main memory, the same vertical cells, the th column on all rows are selected at the same time. All multipliers share the other input, the terminal. At this node the sequence of inputs is applied, while the corresponding column, the th column, is simultaneously selected in the main memory. The output of the multiplier is accumulated in the accumulator . At the end of the sequence, , the accumulator on the th row contains the result of (5) A similar computing structure for the optimization problem is reported in [12] . This result can be sent to three different directions. It can be sent to the OBM or one of the cells in the main memory on the same row. Or, it can be shifted down simultaneously. The multiplexer selects one element of the OBM at a time to read out, sequentially, its contents. The output of the multiplexer is applied to the nonlinear block , which performs additional operations such as addition, subtraction, multiplication, division, nonlinear mapping, and/or logical operations. This block can have an external input . The output of the nonlinear block can be fed back to the -terminal. Since only one nonlinear block is required, a high-performance programmable nonlinear block can be implemented for a wide range of applications. This feature is usually not acceptable in conventional fully parallel implementations.
The content of the OBM is accessible, independent of the accumulator operation. This feature allows the use of pipelining, which means that the computation is performed by multipliers and accumulators while the I/O data is performed at the OBM and the multiplexer.
Since only one interrow connection is required, the vertical expansion of network is unlimited. The horizontal expansion is achieved by increasing the horizontal memory size. However, if the main memory is implemented on chip, then the horizontal network size is limited by the silicon size. If it is implemented off chip, then the limited number of pins limits the vertical network size. 
IV. APPLICATIONS OF VNIP

A. 1-D Networks
In the case of the 1-D networks, the weights are stored in the main memory, as shown in Fig. 5 .
represents the output of th neuron on th layer. To process the first layer, the input is applied as a sequence to the terminal while a corresponding column is selected in the main memory. At the end of the sequence , the accumulators contain the nets of first layer . The contents of accumulators are transferred to the OBM and then the accumulators are reset. The contents of the OBM are read out one by one through the nonlinear block, using the multiplexer. The outputs of the nonlinear block are the output sequence of the first layer. The second layer is processed by applying this output to the terminal, while corresponding weights for the second layer are selected in the main memory. Once all the output buffer memories are read out, the accumulators have . Then this data is sent to the OBM and then read out through the nonlinear block. The output is the sequence of the second layer. A multilayer can be processed by repeating the above procedure.
A discrete-time recurrent network is processed as a multilayers structure, whose weights are identical for all layers.
A feed-forward network can have crossover connections from a nonadjacent layer, as shown in Fig. 6(a) . The weight represents the weight from th neuron on the th layer to the th neuron on th layer. This network can be processed using two processors and a two-input multiplexer, as shown in Fig. 6(b) . Each processor generates the nets for odd and even layers, alternately. Similarly, when there are crossover connections from the th lower layers, then processors and an -inputs multiplexer are required. 
B. 2-D Networks
Based on (4), define
Then the following recursive equation can be formulated as: (7) where is the th row vector in the matrix . Fig. 7 shows these indexing terms in 2-D networks. The and in (7) are the column and the row numbers on the 2-D data, respectively. is the column index within the neighborhood and the weight matrix. Then is given by (8) In (7), the summation is equivalent to that of (5), from the computational point of view. The only difference is that in (5), the weight is a 2-D array and the signal is a 1-D vector, while in (7), the signal is a 2-D array and the weight is a 1-D vector. This means that a fixed hardware structure can process The 2-D propagating data, , are stored in the main memory and the weight, , is applied to the terminal as a sequence, as shown in Fig. 8 . Since the proposed structure is a linear array of processing elements, a 2-D signal is processed column by column.
For the given column , at the first stage . The weights are applied to the terminal, while corresponding columns are selected from the main memory for . Once this procedure is finished, the accumulator contains the . The contents of the accumulators are shifted down and the above procedure is repeated with the weight . After iterations , the accumulator contains , which is the end of the net computations for a given . Then the contents of the accumulators are transferred to the OBM and read out through the nonlinear block. The output sequence corresponds to the result for the th column. Repeat the above procedure for next column.
A 2-D multilayer network can be realized in a way similar to the 1-D case. The output sequence from the nonlinear block can be sent back to the main memory as a sequence, or the nets can be directly sent back to the main memory in parallel.
V. CIRCUIT DESIGN
The major building block in the VNIP is the weighted integrator that consists of a multiplier and an integrator. If the multiplier offset is not canceled, then this offset is undesirably accumulated in the integrator. A switched capacitor weighted integrator with offset cancellation is proposed to tackle this problem. Fig. 9 shows its schematics. The multiplier, including the offset, can be modeled as (9) where is a multiplication constant and , , and are the offsets. These offsets can be canceled, using four combinations of input signal polarity, as follows: (10) Then (11) In Fig. 9 , and are nonoverlapping clocks. The charge injected into the integrator at corresponds to . Using clocks and , the switches at the multiplier input interchange the differential input line and the actual input to the multiplier becomes . At the end of the fourth phase, the integrator contains the offset canceled multiplication, as in (11) . A folded CMOS Gilbert multiplier [13] is used.
The other key analog block is the output multiplexer. This block reads out the content of the OBM without destroying its content. The simple buffer, as shown in Fig. 10 , is used for simplicity. A pair of these circuits is used for the differential structure. A two-stage buffer is used to minimize the clock feedthrough from the selection switch to the OBM capacitor. The sources of the source followers are connected together and form a data bus line that is biased by a current source . Since only one of the selection switches in the OBM is turned on at a time, only one source follower loads the data on the data bus line. Fig. 11 shows the block diagram of the VNIP. The gray rectangular box represents the pad. All the signals and building blocks are fully differential. The inputs are coming from the memory and are held at sample-and-hold (S/H) circuit. One S/H circuit is used for the input terminal. The clocks , , , and are generated by an on-chip integrator clock. The signal in the integrator can be copied to the OBM by turning on the transfer switch. The content of the OBM can be shifted down to the integrator on one row below, using the down switch. The bottom row has output buffer to drive the pad, to lower. This pad is connected the from-upper pad on the other chip for expansion. This expansion is independent from the large capacitance of the expansion pin because it is driven by a relatively large buffer. This structure allows unlimited network size expansion. The down operation in 2-D networks is achieved by a transfer-reset-down clock sequence. The signal on the data bus line can be sent to the pad through the output buffer. It also can be sent to the multiplier's terminal by turning on the loop switch. The contents of the accumulators can be accessed in parallel, through the same pad for input signal that is multiplexed by the switch. The main memory can be a capacitor memory array as a short-time memory. In the case of the 1-D network, the main memory should be a permanent memory that can be implemented, using a floating gate array. In the case of an image processor, the main memory is replaced with a photosensor array as an input device.
One effect of process variation is the gain of mismatch of multipliers. The gain of each row can be obtained by measuring the output of a row with test input signal. For the 1-D network, the weight on the same row should be scaled according to the corresponding measured row gain. This procedure can be implemented in the training. In the case of the 2-D network, the gain of each row is stored in an external permanent memory and used for compensation at the output.
The proof-of-concept chip is fabricated using a double-poly 1.2-m CMOS process, to demonstrate the flexibility of the proposed architecture. Fig. 12 shows a microphotograph of fabricated chip. This chip contains eight rows of processing elements. The maximum clock frequency of the test chip is 4 Mhz. The silicon area is /row. The throughput is 12 10 synapses/s mm and the energy consumption is 10 J/synapse, where one synapse corresponds to one multiplication and addition. Note that these data do not include the main memory, whose implementation is application dependent. The accuracy of the computation is dependent on the choice of circuit, power supply voltage, application, and network size.
VI. EXPERIMENTAL TEST-CHIP RESULTS
The main memory is emulated using a PC and dataacquisition boards. A software program is written for the user interface. For illustration purposes, three different types of neural networks (a two-layer feed-forward network, a fully connected recurrent network, and a 2-D network) are implemented without any hardware modification. Though the network size expansion is not limited, small-sized examples are presented in this paper.
The XOR problem, depicted in Fig. 13(a) , is implemented to demonstrate the application for the two-layer feedforward network. Figure 13(b) shows its truth table. The weights are properly chosen and then a test input sequence is applied to the input of the network. Fig. 13(c) shows the states of accumulators measured from the chips, using an oscilloscope. The expected signs of the neuron's state, , for the above test inputs sequence are obtained.
The winner-takes-all (WTA) problem is implemented to demonstrate the application for a 1-D recurrent network, as shown in Fig. 14(a) . The initial conditions are set to 1 and 0.8 V, respectively. The weights between neurons are set to 0.2 V and self loop weights are set to 1 V. Fig. 14(b) shows the output of each neuron with respect to the iteration number. The graph is taken from the user interface of the software. The winner which was set to 1 V stays at 1 volt, but the loser which was set to 0.8 V goes to 0 V. A 3 3 neighborhood is implemented to demonstrate the application for a 2-D network. The white pixel in Fig. 15 represents 4 V, the black represents 4 V, and the gray represents zero. A propagation test is performed for the functionality test [14] . Fig. 15 shows one example of a propagation test. Since only the lower left weight is white, the image should be shifted toward the upper right direction, without changing the polarity. The output image shows the expected correct result.
VII. CONCLUSION
A tradeoff of versatility versus circuit complexity has been implemented in the proposed NIP. The proposed neuroimage processor provides a flexible and expandable architecture that is capable of processing a number of neural networks or image processing structures, without any hardware modifications, and a wide range of applications can be expected from this processor. The structure allows unlimited expansion of network size and the compensation of process variation. 
