Abstract-The mixed-signal processor performs digital vectormatrix multiplication using internally analog fine-grain parallel computing. The three-transistor CID/DRAM unit cell combines single-bit dynamic storage, binary multiplication, and zerolatency analog accumulation. Matrix coefficients are stored in a bit-parallel form. Delta-sigma analog-to-digital conversion of the analog array outputs is combined with oversampled unary coding of the digital inputs. Sorting of unary inputs results in at most a single input line transition for arbitrary multi-bit inputs. This amounts to a linear gain in energy efficiency of the computational array in the number of bits of the input vector. The 256 × 128 CID/DRAM processor with integrated 128 deltasigma ADCs measures 3 mm × 3 mm in 0.5 µm CMOS and delivers 6.5 GMACS dissipating 5.9 mW of power. CID/DRAM array dynamic power dissipation is reduced by a factor of four through sorting 8-bit inputs.
I. INTRODUCTION
Real-time computing of linear transforms on a batterypowered mobile platform imposes great demands on computational throughput and power consumption. The computational core of linear transforms in image and video processing applications, such as artificial vision and human-computer interfaces, is that of vector-matrix multiplication (VMM) in high dimensions:
with N -dimensional input vector X n , M -dimensional output vector Y m , and M × N matrix elements W mn (templates). The presented mixed-signal VMM processor contains a finegrain parallel computational array, achieving a computational throughput of 1.1 GMACS for every mW of power. In what follows we concentrate on massively parallel VMM computation on a mixed-signal VLSI architecture with minimal activity inputs.
II. MIXED-SIGNAL COMPUTATION

A. Internally Analog, Externally Digital Computation
The approach combines the computational efficiency of analog array processing with the precision of digital processing and the convenience of a programmable and reconfigurable digital interface.
The digital representation is embedded in the analog array architecture, with matrix elements stored locally in bit-parallel form
and inputs presented in bit-serial fashion
where the coefficients γ j are assumed in radix two, depending on the form of input encoding used. The VMM task (1) then decomposes into
with VMM partials
The binary-binary partial products (6) are conveniently computed and accumulated, with zero latency, using an analog VMM array [1] - [2] . In principle, the VMM partials (6) can be quantized by a bank of flash analog-to-digital converters (ADCs), and the results accumulated in the digital domain according to (5) and (4) to yield a digital output resolution exceeding the analog precision of the array and the quantizers [3] . In the present work, an oversampling ADC accumulates the sum (5) in the analog domain, with inputs encoded in unary format (γ i = 1). This avoids the need for high-resolution flash ADCs, which are replaced with single-bit quantizers in the delta-sigma loop.
B. CID/DRAM Cell and Array
The unit cell in the analog array combines a CID (charge injection device [4] ) computational element [2] with a DRAM storage element. The cell stores one bit of a matrix element w mn (i) , performs a one-quadrant binary-unary (or binarybinary) multiplication of w mn (i) and x n (j) in (6), and accumulates the result across cells with common m and i indices. The circuit diagram and operation of the cell are given in Fig. 1 . It performs a non-destructive computation since the transferred charge, Q, is sensed capacitively at the output. An array of cells thus performs (unsigned) binary-unary multiplication (6) of matrix w mn (i) and vector x n (j) yielding Y m (i,j) , for values of i in parallel across the array, and values of j in sequence over time.
C. Oversampling Mixed-Signal Array Processing
The conventional delta-sigma (∆Σ) ADC design paradigm allows to reduce requirements on precision of analog circuits to attain high resolution of conversion, at the expense of bandwidth. In the presented architecture a high conversion rate is maintained by combining delta-sigma analog-to-digital conversion with oversampled encoding of the digital inputs, where the delta-sigma modulator integrates the partial multiply-andaccumulate outputs (6) from the analog array according to (5) . Fig. 2 depicts one row of matrix elements W mn in the ∆Σ oversampling architecture, encoded in I = 4 bit-parallel rows of CID/DRAM cells. One bit of a unary-coded input vector is presented each clock cycle, taking J clock cycles to complete a full computational cycle (1) . The data flow is illustrated for a digital input series x n (j) of J = 16 unary bits. Over J clock cycles, the oversampling ADC integrates the partial products (6), producing a decimated output
where γ j = 1 for unary coding of inputs. Decimation for a first-order delta-sigma modulator is achieved using a binary counter. Higher precision can be obtained in the same number of cycles J by using a higher-order delta-sigma modulator topology. However this drastically increases the implementation complexity. Instead, we use a modified topology that resamples the residue of the integrator after initial conversion [6] . A sampleand-hold resamples the residue voltage of the integrator and presents it to the modulator input for continued conversion at a finer scale. With a single resampling of the residue, the ∆Σ modulator obtains 8-bit effective resolution in 32 cycles.
D. VLSI Implementation
A mixed-signal VMM processor prototype integrated on a 3 × 3 mm 2 die was fabricated in 0.5 µm CMOS technology. The chip contains an array of 256 × 128 CID/DRAM cells, and a row-parallel bank of 128 ∆Σ algorithmic ADCs. Fig. 3 depicts the micrograph and system floorplan of the chip. The layout size of the CID/DRAM cell is 18λ × 45λ with λ = 0.3µm.
The processor interfaces externally in digital format. Two separate shift registers load the templates along odd and even columns of the DRAM array. Integrated refresh circuitry periodically updates the charge stored in the array to compensate for leakage. Vertical bit lines extend across the array, with two rows of sense amplifiers at the top and bottom of the array. The refresh alternates between even and odd columns, with separate select lines. Fig. 4 shows the measured linearity of the computational array. For every shift in the input register, a computation is performed and the result is observed on the output sense line. Micrograph of the mixed-signal pattern recognition processor prototype, containing an array of 256 × 128 CID/DRAM cells, and a rowparallel bank of 128 ∆Σ algorithmic ADCs. Die size is 3 mm × 3 mm in 0.5 µm CMOS technology. The chip contains 128 row-parallel ∆Σ algorithmic ADCs, i.e. one dedicated ADC for each m and i. In the present implementation, Y m is obtained off-chip by combining the ADC quantized outputs Y (i) m over i (rows) according to (4). The ∆Σ ADC yields 8-bit resolution over two subranging cycles of 4 bits each, for a total of 32 clock cycles [6] . Table I summarizes the measured performance. The CID/DRAM array dissipates 3.3 mW for a 10 µs computational cycle, and the bank of ∆Σ ADCs dissipates 2.6 mW yielding a combined conversion rate of 12.8 Msamples/s at 8-bit resolution.
III. MINIMAL ACTIVITY VMM ARCHITECTURE
The oversampling architecture described in Section II-C maintains high throughput by combining all of the array computational cycles for a single bit-serial input within one delta-sigma modulated analog-to-digital conversion. In order to do so, the input is digitally oversampled by a binary-tounary converter. The simplest K-bit binary-to-unary converter is a bank of K latches (per input vector component) where the binary value stored in the k-th latch is presented to the output 2 k times, for k = 0, ..., K − 1. Such conversion from the binary representation to the unary one preserves the number of bit-to-bit "0"-to-"1" and "1"-to-"0" logic level transitions in the bit-serial data stream. Dynamic power dissipated by the computational array is proportional to the number of such transitions as array input lines are driven to input vector coefficient x
n values. The number of bit-to-bit transitions in each input vector component can range from 0 to K (counting the transition to the next input) depending on the input data statistics. Array dynamic power can therefore be minimized by minimizing input bit-to-bit transitions.
The purpose of this Section is two-fold. First, real image data statistics are presented demonstrating their non-minimal bit-to-bit transition activity. Second, a technique for minimizing the array dynamic power dissipation through unary input data sorting is presented and validated on real image data.
A. Real Image Data Statistics
In artificial vision systems and interactive human computer interfaces input data are real images. This Section investigates bit-to-bit transition statistics of real images on the example of Lena. Fig. 5 depicts Lena's bit transition statistics for different binary resolutions, K. For K ≥ 3, LSB-to-(LSB-1) bit transitional probabilities (shown with the dashed line) are approximately 0.5. This bit transition Bernoulli probability distribution is due to the fact that real images have uncorrelated less significant bits, which are Bernoulli random variables themselves. Assuming independence, bit-to-bit "0"-to-"1" and "1"-to-"0" transitional probabilities are approximately equal to 0.25 each for those less significant bits. This yields approximately a 0.5 probability of a LSB-to-(LSB-1) transition for K ≥ 3. The cumulative bit transitional probability (shown with the solid line in Fig. 5 ) is greater than 0.5 due to higher correlation of the more significant bits.
Dynamic power dissipation of the computational array is proportional to the input switching activity. The simple statistical study above demonstrates that when computation is performed on real image data, such as Lena, on average input transitions happen at least K/2 times for K-bit inputs.
B. Unary Input Sorting
The unary nature of the input allows to reduce the number of logic level transitions in its bit-serial sequence. Doing so reduces the dynamic power dissipation of the computational array proportionally, without affecting the computation results. All unary coefficients have the same weight (γ = 1) and their temporal order in the bit-serial input sequence in the oversampling architecture in Fig. 2 is not important. The number of input transitions can be reduced from K/2 (for Bernoulli input data) to two per input component by simple bit sorting.
Bit sorting is a computationally inexpensive operation requiring little overhead. One example of a 4-bit binary-tosorted-unary converter implementation is depicted in Fig. 6 . The 4-bit shift register is a part of the data pipeline and is reused as a counting bit sorter by adding a few extra gates and switches per input component. The overhead is insignificant as it scales linearly in the number of input dimensions.
The number of input transitions can be further reduced to one per input component by extending the bit sorter in Fig. 6 to alternate sorting up and down for subsequent input vectors. The negligent overhead in integration area and power dissipation of the binary-to-sorted-unary converter yields at least a factor of K/2 decrease in power dissipation. For 8-bit images, this corresponds to a four-fold gain in energy efficiency. The results are validated by simulating the gain in energy efficiency for both Bernoulli data and Lena as shown 
IV. CONCLUSIONS
A minimal switching activity oversampling charge-mode VLSI architecture for computing real-time linear transforms has been presented. An internally analog, externally digital architecture offers the best of both worlds: the density and energetic efficiency of an analog VLSI array, and the convenience and versatility of a digital interface. A ∆Σ oversampled algorithmic ADC architecture relaxes precision requirements in the quantization and allows for input bit sorting for minimum switching activity.
A 256 × 128 cell prototype was fabricated in 0.5 µm CMOS. The combination of analog array processing, oversampled input encoding, and ∆Σ algorithmic analog-todigital conversion delivers a computational throughput of over 1 GMACS per mW of power, while maintaining 8-bit effective digital resolution. Input bit sorting reduces the CID/DRAM dynamic power dissipation by a factor of four for 8-bit inputs.
