Abscrocr-We present a mixed-signal distributed VLSI architecture for massively parallel array processing, with fine-grain embedded memory. The three-transistor processing element in the array combines a charge injection device (CID) binary multiplier and analog accumulator with embedded dynamic random-access memory (DRAM). A prototype 512 x 128 vector-matrix multiplier on a single 3 mm x 3 mm chip fabricated in standard CMOS 0.5 pm technology achieves 8-bit effective resolution and dissipates 0.5 pJ per multiply-accumulate.
I. INTRODUCTION
One of the greatest challenges in the performance of computer systems today is limited memory bandwidth. Conventional solutions to the speed mismatch between microprocessors and memory devote a large fraction of the transistors and area of the chips to static memory caches, leading to sub-optimal computational efficiency and silicon area. Embedded designs with memory and logic integrated together are clearly more desirable for memory intensive tasks.
We propose a massively parallel fine-grain array processor architecture with each cell containing a computing device and a storage element. We employ a multiply-accumulate processing element as a computing device to perform very computationally intensive operation, vector-matrix multiplication (VMM) in large dimensions. VMM in large dimensions is one of the most common, but computationally most expensive operation in algorithms for machine vision, image classification and pattern recognition: The problem with most parallel systems is that they require centralized memory resources i.e., RAM shared on a bus, thereby limiting the available throughput, or do incorporate memories and digital processing elements together, but tend to use a lot of silicon area to implement those, significantly limiting the dimensions of the matrices operated on. A fine-grain, fully-parallel architecture, that integrates memory and processing elements, yields high computational throughput and high density of integration. The ideal scenario for array processing (in the case of vector-matrix multiplication) is where each processor performs one multiply and locally stores one coefficient. The advantage of this is a throughput that scales linearly with the dimensions of the implemented array.
The recurring problem with digital implementation is the Iatency in accumulating the result over a large number of cells. Also, the extensive silicon area and power dissipation of a digital multiply-and-accumulate implementation make this approach prohibitive for very large (100-10,000) matrix dimensions. Analog VLSI provides a natural medium to implement fully paralIel computational arrays with high integration density and energy efficiency [5] . By summing charge or current on a single wire across cells in the array, low latency is intrinsic. Analog multiply-and-accumulate circuits are so small that one can be provided for each matrix element, making it feasible to implement massively parallel implementations with large matrix dimensions. Fully parallel implementation of ( 1 ) requires an M x N array of cells, each cell containing a product computing device and-a storage element. Each cell (m, n) computes the product of input component X ( n ) and matrix element W(min), and dumps the resulting current or charge on a horizontal output summing line. The device storing W(m>n) is usually incorporated into the computational cell to avoid performance limitations due to low external memory access bandwidth. The main problem with purely analog implementation is the effect of noise and component mismatch on precision. To this end, we propose the use of hybrid analog-digital technology to simultaneously add a large number of digital values in parallel, with-careful consideration of sources of imprecision in the implementation and their overall effect on the system performance. Our approach combines the computational efficiency of analog array processing with the precision of digital processing and the convenience of a programmable and reconfigurable digital interface.
A mixed-signal array architecture with binary decomposed matrix and vector elements is described in Section 11. VLSI implementation is presented in Section 111. Section IV quantifies the improvements obtained in system precision obtained by postprocessing the quantized outputs of the array in the digital domain. and compensating for analog computation offset errors 
MIXED-SIGNAL ARCHITECTURE

A. Internally Analog, Externally Digital Computation
The system presented is internally implemented in analog VLSi technology, but interfaces externally with the digital world. This paradigm combines the best of both worlds: it uses the efficiency of massively parallel analog computing (in particular: adding numbers in parallel on a single wire), but allows for a modular, configurable interface with other digital pre-processing and post-processing systems. This is necessary to make the processor a general-purpose device that can tailor the vector-matrix multiplication task to the particular application where it is being used.
The digital representation is embedded, in both bit-serial and bit-parallel fashion, in the analog array architecture (Fig. 1) . Inputs are presented in bit-serial fashion, and matrix elements are stored locally in bit-parallel form. Digital-to-analog (D/A) conversion at the input interface is inherent in the bit-serial implementation, and row-parallel analog-to-digital ( A D ) converters are used at the output interface.
For simplicity, an unsigned binary encoding of inputs and matrix elements is assumed here, for one-quadrant multiplication. This assumption is not essential: it has no binding effect on the architecture and can be easily extended to a standard one's complement for four-quadrant multiplication, in which the significant bits (MSB) of both arguments have a negative rather than positive weight. Assume further I-bit encoding of matrix ele-392 ments, and J-bit encoding of inputs:
The proposed mixed-signal approach is to compute and accumulate the binary-binary partial products (5) using an analog VMM array, and to combine the quantized results in the digital domain according to (4).
B. Array Architecture and Data Flow
To conveniently implement the partial products (5), the binary encoded matrix elements wi(m+) are stored in bit-parallel form, and the binary encoded inputs are presented in bit-serial fashion. The bit-serial format was first proposed and demonstrated in [8], with binary-analog partial products using analog matrix elements for higher density of integration. The use of binary encoded matrix elements relaxes precision requirements and simplifies storage [9]. One row of I-bit encoded matrix elements uses I rows of binary cells. Therefore, to store an M x N digital matrix W(min), an array of M I x N binary cells is needed. One bit of an input vector is presented each clock cycle, taking J clock cycles of partial products ( 5 ) to complete a full computational cycle (1). The input binary components zj(n) are presented least significant bit (LSB) first, to facilitate the digital postprocessing to obtain (4) from ( 5 ) (as elaborated in Section IV).
Figure 1 depicts one row of matrix elements W(m*fl) in the binary encoded architecture, comprising I rows of binary cells where I = 4 in the example shown. The data flow is illustrated for a digital input series zj(n) of J = 4 bits, LSB first (i.e., descending index j ) . The corresponding analog series of outputs yi,j(m) in ( 5 ) obtained at the horizontal summing nodes of the analog array is quantized by a bank of analog-to-digital converters (ADC), and digital postprocessing (4) of the quantized series of output vectors yields the final digital result (1).
The quantization scheme used is critical to system performance. As shown in Section IV, appropriate postprocessing in the digital domain to obtain (4) from the quantized partial products x,j(m) can lead to a significant enhancement in system resolution, well beyond that of intrinsic ADC resolution. This relaxes precision requirements on the analog implementation of the partial products (5). A dense and efficient chargemode VLSI implementation is described next.
CHARGE-MODE VLSI IMPLEMENTATION
A. CID/DRAM Cell and Array
The elementary cell combines a CID computational unit where Cll.13 is the total capacitance on the output line across cells. The total response is thus proportional to the number of actively transferring cells. After deactivating the input ~j (~) , the transferred charge returns to the storage node M2. The CID computation is non-destructive and intrinsically reversible [8] , and DRAM refresh is only required to counteract junction and subthreshold leakage.
The bottom diagram in Figure 2 Transistor-level simulation of a 5 12-element row indicates a dynamic range of 43 dB, and a computational cycle of 10 ps with power consumption of 50 nW per cell. Experimental results from a fabricated prototype are presented next.
B. Experimental Results
We designed, fabricated and tested a VLSI prototype of the vector-matrix multiplier, integrated on a 3 x3 mm2 die in 0.5 pm CMOS technology. The chip contains an array of 512 x 128 dressed below. Fig. 3 . Micrograph of the mixed-signal VMM prototype, containing an array of 512 x 128 CIDDRAM cells, and a row-parallel bank of 128 Rash ADCs. Die size is 3 rnrn x 3 rnm in 0.5 pm CMOS technology.
CIDDRAM cells, and a row-parallel bank of 128 gray-code flash ADCs. Figure 3 depicts the micrograph and system floorplan of the chip. The layout size of the CIDDRAM cell is 8X x 45X with X = 0.3pm. The mixed-signal VMM processor interfaces externally in digital format. Two separate shift registers load the matrix elements along odd and even columns of the DRAM array. Integrated refresh circuitry periodically updates the charge stored in the array to compensate for leakage. Vertical bit lines extend across the array, with two rows of sense amplifiers at the top and bottom of the array. The refresh alternates between even and odd columns, with separate select lines. Stored charge corresponding to matrix element values can also be read and shifted out from the chip for test purposes. All of the supporting digital clocks and control signals are generated on-chip. 
A. Accumulation and Quantization
Significant improvements in precision can be obtained by exploiting the binary representation of matrix elements and vector inputs, and performing the computation (4) in the digital domain, from quantized estimates of the partial outputs (5).
We quantize all I x J values of K,~(") using row parallel flash A/D converters. Figure 5 presents the corresponding architecture, shown for a single output vector component m. The partials summation is then performed in the digital domain: We obtain an improvement in signal-to-quantization-noise ratio of a factor 3 and a median resolution gain of approximately 2 bits over the resolution of each ADC.
B. Multi-Chip System with Offset Compensation
Other significant sources of error in analog array-based computation are input-dependent feedthrough, and input and timedependent charge leakage in DRAM storage cells introducing offsets. Both of these errors are compensated for in a multichip VMM architecture by m$ng one reference chip supplied with identical inputs, synchronous refresh clock and all logic "0' matrix elements. Subtraction of outputs of equivalent rows in digital domain eliminates both input-dependent and temporal errors as shown in detail in 1171.
V. CONCLUSIONS
A charge-mode VLSI array processor for matrix operations in large dimensions ( N , M = 100-10,000) has been presented. The architecture embeds storage and multiplication in distributed fashion, down to the cellular level. With only three transistors, the cell for multiplication and storage contains little more than either a DRAM or a CID cell. This makes the analog cell very compact and low power, and the regular array of cells provides for a scalable architecture that can easily be extended. Fine-grain massive parallelism and distributed memory provide computational efficiency (bandwidth to power consumption ratio) exceeding that of digital multiprocessors and DSPs by several orders of magnitude. A 512 x 128 VMM prototype fabricated in 0.5 pm CMOS offers 2 x 10l2 binary MACS (multiply accumulates per second) per Watt of power.
