sions and background clutter in the surrounding, which usually necessitates a large number of training examples to generalize properly. The other is the excessive amount of computation incurred during training, and even in run-time.
Support vector machines have been applied to visual object detection, with demonstrated success in face and pedestrian detection tasks [3] [4] [5] [6] . Unlike approaches to object detection that rely heavily on hand-crafted models and motion information, SVM-based systems learn the model of the object of interest from examples and work reliably in absence of motion cues. To reduce the computational burden of real-time implementation to a level that can be accommodated with available hardware, a reduced set of features are selected from the data which also result in a reduced number of support vectors [5] . The reduction in implementation necessarily comes at a loss in classification performance, a loss which is more severe for tasks of greater complexity.
The run-time computational load is dominated by evaluation of a kernel between the incoming vector and each of the support vectors. For a large class of permissible kernels, which include polynomial splines and radial kernels, this computation entails matrix-vector multiplication in large dimensions. For the pedestrian detection task in unconstrained environments [5] , highest detection at lowest false alarm is achieved for very large numbers (thousands) of input dimensions and support vectors, incurring millions of matrix multiply-accumulates (MAC) for each classification. The computation recurs at different positions and scales across each video frame.
The Kerneltron offers a factor 100-10 000 improvement in computational efficiency (throughput per unit power) over the most advanced digital signal processors available today. It affords this level of efficiency at the expense of specificity: the very large-scale integration (VLSI) architecture is dedicated to massively parallel kernel computation [7] . Speed can be traded for power dissipation. Lower power is attractive in portable applications of kernel-based pattern recognition, such as visual aids for the blind [8] .
Section II briefly summarizes feature extraction and SVM classification for object detection in streaming video. Section III describes the architecture and circuit implementation of the Kerneltron. Experimental results, scalability issues, training, and application examples are discussed in Section IV.
II. OBJECT DETECTION WITH SUPPORT VECTOR MACHINES
A support vector machine is trained with a data set of labeled examples. For pattern classification in images, relevant features are typically extracted from the training set examples using redundant spatial filtering techniques, such as overcomplete wavelet decomposition [4] . The classifier is trained on these feature vectors. In run time, images representing frames of streaming video are scanned by moving windows of different dimensions. For every unit shift of a moving window, a wavelet feature vector is computed and presented to the SVM classifier to produce a decision. The general block diagram of such a system is outlined in Fig. 1 . A brief functional description of the major components follows next.
A. Overcomplete Wavelet Decomposition
An overcomplete wavelet basis enables the system to handle complex shapes and achieve a precise description of the object class at adequate spatial resolution for detection [4] . The transformation of the sensory data into the feature vector is of the linear form (1) where the wavelet coefficients form an overcomplete basis, i.e., . In visual object detection overcomplete Haar wavelets have been successfully used on pedestrian and face detection tasks [4] , [5] . Haar wavelets are attractive because they are robust and particularly simple to compute, with coefficients that are either 1 or 1.
B. Support Vector Classification
Classification of the wavelet transformed features is performed by an SVM [1] . From a machine learning theoretical perspective [2] , the appealing characteristics of SVMs are as follows.
1) The learning technique generalizes well even with relatively few data points in the training set, and bounds on the generalization error can be directly estimated from the training data.
2) The only parameter that needs tuning is a penalty term for misclassification which acts as a regularizer [9] and determines a tradeoff between resolution and generalization performance [10] .
3) The algorithm finds, under general conditions, a unique separating decision surface that maximizes the margin of the classified training data for best out-of-sample performance.
SVMs express the classification or regression output in terms of a linear combination of examples in the training data, in which only a fraction of the data points, called "support vectors," have nonzero coefficients. The support vectors thus capture all the relevant data contained in the training set. In its basic form, a SVM classifies a pattern vector into class based on the support vectors and corresponding classes as (2) where is a symmetric positive-definite kernel function which can be freely chosen subject to fairly mild constraints [1] . The parameters and are determined by a linearly constrained quadratic programming (QP) problem [2] , [11] , which can be efficiently implemented by means of a sequence of smaller scale, subproblem optimizations [3] , or an incremental scheme that adjusts the solution one training point at a time [12] . Most of the training data have zero coefficients ; the nonzero coefficients returned by the constrained QP optimization define the support vector set. In what follows we assume that the set of support vectors and coefficients are given, and we concentrate on efficient run-time implementation of the classifier.
Several widely used classifier architectures reduce to special valid forms of kernels , like polynomial classifiers, multilayer perceptrons, 1 and radial basis functions [14] . The following forms are frequently used: 1) inner-product based kernels (e.g., polynomial; sigmoidal connectionist):
2) radial basis functions ( norm distance based)
where is a monotonically nondecreasing scalar function subject to the Mercer condition on [2] , [9] . With no loss of generality, we concentrate on kernels of the inner product type (3), and devise an efficient scheme of computing a large number of high-dimensional inner-products in parallel. Computationally, the inner-products comprise the most intensive part in evaluating kernels of both types (3) and (4). Indeed, radial basis functions (4) can be expressed in innerproduct form (5) where the last two terms depend only on either the input vector or the support vector. These common terms are of much lower complexity than the inner-products, and can be easily precomputed or stored in peripheral registers.
The computation of the inner-products takes the form of matrix-vector multiplication (MVM),
, where is the number of support vectors. For large scale problems as the ones of interest here, the dimensions of the matrix are excessive for real-time implementation even on a high-end processor. As a point of reference, consider the pedestrian and face detection task in [5] , for which the feature vector length is 1326 wavelets per instance, and the number of support vectors is in excess of 4000. To cover the visual field over the entire scanned image at reasonable resolution (500 image window instances through a variable resolution search method) at video rate (30 frames/s), a computational throughput of multiply-and-accumulate operations/s, is needed. The computational requirement can be relaxed through simplifying and further optimizing the SVM architecture for real-time operation, but at the expense of classification performance [4] , [5] .
III. KERNELTRON: MASSIVELY PARALLEL VLSI KERNEL MACHINE

A. Core Recognition VLSI Processor
At the core of the system is a recognition engine, which efficiently implements kernel-based algorithms, such as SVMs, for general pattern detection and classification. The implementation focuses on inner-product computation in a parallel architecture.
Both wavelet and SVM computations are most efficiently implemented on the same chip, in a scalable VLSI architecture as illustrated schematically in Fig. 2 . The diagram is the floorplan of the Kerneltron, with matrices projected as two-dimensional (2-D) arrays of cells, and input and output vector components crossing in perpendicular directions alternating from one stage to the next. This style of scalable architecture also supports the integration of learning functions, through local outer product parameter updates [13] , compatible with the recently developed incremental SVM learning rule [12] . The architecture maintains low input-output data rate. Digital inputs are fed into the processor through a properly sized serial/parallel converter shift register. A unit shift of a scanning moving window in an image corresponds to one shift of a new pixel per classification cycle, while a single scalar decision is produced at the output.
The classification decision is obtained in digital domain by thresholding the weighted sum of kernels. The kernels are obtained by mapping the inner-products through the function stored in a lookup table. By virtue of the inner-product form of the kernel, the computation can be much simplified without affecting the result. Since both the wavelet feature extraction and the inner-product computation represent linear transformations, they can be collapsed into a single linear transformation by multiplying the two matrices (6) Therefore, the architecture can be simplified to one that omits the (explicit) wavelet transformation, and instead transforms the support vectors. 2 For simplicity of the argument, we proceed with the inner-product architecture excluding the overcomplete wavelet feature extraction stage, bearing in mind that the approach extends to include wavelet extraction by merging the two matrices. 
B. Internally Analog, Externally Digital Computation
Computing inner-products between an input vector and template vectors in parallel is equivalent to the operation of matrix-vector multiplication (MVM) (7) with -dimensional input vector -dimensional output vector , and matrix of coefficients . The matrix elements denote the support vectors , or the wavelet transformed support vectors (6) for convenience of notation. 3 The approach combines the computational efficiency of analog array processing with the precision of digital processing and the convenience of a programmable and reconfigurable digital interface. The digital representation is embedded in the analog array architecture, with matrix elements stored locally in bit-parallel form (8) and inputs presented in bit-serial fashion where the coefficients are assumed in radix two, depending on the form of input encoding used. The MVM task (7) then decomposes into (10) with MVM partials (11) (12) The binary-binary partial products (12) are conveniently computed and accumulated, with zero latency, using an analog MVM array [15] [16] [17] [18] . For this purpose, we developed a 1-bit multiply-and-accumulate CID/DRAM cell.
C. CID/DRAM Cell and Array
The unit cell in the analog array combines a charge injection device (CID) [19] computational element [17] , [18] with a DRAM storage element. The cell stores one bit of a matrix element , performs a one-quadrant binary-unary (or binarybinary) multiplication of and in (12) , and accumulates the result across cells with common and indexes. The circuit diagram and operation of the cell are presented in Fig. 3(a) . An active charge transfer from M2 to M3 can only occur if there is nonzero charge stored, and if the potential on the gate of M2 drops below that of M3 [17] . The cell performs nondestructive computation since the transferred charge is sensed capacitively at the output. Thus, an array of cells performs (unsigned) binary-unary multiplication (12) of a matrix with elements and a vector with elements yielding , for values of in parallel across the array, and values of in sequence over time. A 256 128 array prototype using CID/DRAM cells is shown in Fig. 3(b) .
To improve linearity and to reduce sensitivity to clock feedthrough, we use differential encoding of input and stored bits in the CID/DRAM architecture using twice the number of columns and unit cells as shown in Fig. 4(a) . This amounts to exclusive-OR (XOR), rather than AND, multiplication on the analog array, using signed, rather than unsigned, binary values for inputs and weights, and . In principle, the MVM partials (12) can be quantized by a bank of flash analog-to-digital converters (ADCs), and the results accumulated in the digital domain according to (11) and (10) to yield a digital output resolution exceeding the analog precision of the array and the quantizers [20] . Alternatively, an oversampling ADC accumulates the sum (11) in the analog domain, with inputs encoded in unary format . This avoids the need for high-resolution flash ADCs, which are replaced with single-bit quantizers in the delta-sigma loop.
D. Oversampling Mixed-Signal Array Processing
The precision of computation is limited by the resolution of the analog-to-digital converters (ADCs) digitizing the analog array outputs. The conventional delta-sigma ADC design paradigm allows to reduce requirements on precision of analog circuits to attain high resolution of conversion, at the expense of bandwidth. In the presented architecture a high conversion rate is maintained by combining delta-sigma ADC with oversampled encoding of the digital inputs, where the delta-sigma modulator integrates the partial multiply-and-accumulate outputs (12) from the analog array according to (11) . Fig. 5 depicts one row of matrix elements in the oversampling architecture, encoded in bit-parallel rows of CID/DRAM cells. One bit of a unary-coded input vector is presented each clock cycle, taking clock cycles to complete a full computational cycle (7) . The data flow is illustrated for a digital input series of unary bits. Over clock cycles, the oversampling ADC integrates the partial products (12) , producing a decimated output (13) where for unary coding of inputs. Decimation for a firstorder delta-sigma modulator is achieved using a binary counter.
E. Row-Parallel Algorithmic ADC
Higher precision can be obtained in the same number of cycles by using a higher order delta-sigma modulator topology. However this drastically increases the implementation complexity. Instead, we use a modified topology shown in Fig. 6 that resamples the residue of the integrator after initial conversion. A sample-and-hold resamples the residue voltage of the integrator and presents it to the modulator input for continued conversion at a finer scale. The principle is analogous to extended counting [21] but avoids additional hardware by reusing the same modulator to quantize the residue. Similar to residue resampling in an algorithmic (or cyclic) ADC, for each resampling the scale of conversion subranges to the LSB level of the previous conversion. For a first-order incremental ADC [22] , resampling of the residue scales the range by a factor , where is the number of modulation cycles. If is of radix two, i.e., , then the subranging is conveniently accomplished in the architecture of Fig. 6 by shifting the bits in the decimating counter by positions for every resampling of the residue.
Every resampling improves the output resolution by a factor , or bits, limited by noise and mismatch in the implementation. The effect of capacitance mismatch is minimized by using a ratio-insensitive scheme for resampling the residue [23] . The presented scheme is equivalent to algorithmic ADC, but avoids interstage gain errors without the need for precisely ratioed analog components.
The resampling of the residue in the oversampled ADC can be combined with correspondingly rescaling the coefficients in the input encoding. In principle, higher resolution digital inputs can be presented by unary encoding bits in groups of , each covering modulation cycles of the subranging oversampled ADC [23] . In Fig. 5 , only the first four bits are unary encoded and presented in the first algorithmic cycle, with . With a single resampling of the residue, the modulator obtains bit effective resolution in cycles. The final product is constructed in the digital domain according to (10) . Additional gains in precision can be obtained by exploiting binomial statistics of binary terms in the analog summation (12) and [24] . In the present scheme, this would entail stochastic encoding of the digital inputs prior to unary oversampled encoding.
IV. EXPERIMENTAL RESULTS AND DISCUSSION
A. Measured Performance
A prototype Kerneltron was integrated on a 3 3 mm die and fabricated in 0.5 m CMOS technology. The chip contains an array of 256 128 CID/DRAM cells, and a row-parallel bank of 128 algorithmic ADCs. Fig. 3(b) depicts the micrograph and system floorplan of the chip.
The processor interfaces externally in digital format. Two separate shift registers load the templates (support vectors) along odd and even columns of the DRAM array. Integrated refresh circuitry periodically updates the charge stored in the array to compensate for leakage. Vertical bit lines extend across the array, with two rows of sense amplifiers at the top and bottom of the array. The refresh alternates between even and odd columns, with separate select lines. Stored charge corresponding to matrix element values can also be read and shifted out from the chip for test purposes. All of the supporting digital clocks and control signals are generated on-chip. Fig. 4(b) shows the measured linearity of the computational array, configured differentially for signed (XOR) multiplication. The case shown is where all complementary weight storage elements are actively set, and an alternating sequence of bits in blocks is shifted through the input register. 4 For every shift in the input register, a computation is performed and the result is observed on the output sense line. The array dissipates 3.3 mW for a 10 s cycle time. The bank of ADCs dissipates 2.6 mW yielding a combined conversion rate of 12.8 Msamples/s. Table I summarizes the measured performance. Fig. 7 compares template matching performed by a floating point processor and by the Kerneltron, illustrating the effect of quantization and limited precision in the analog array architecture. An "eye" template was selected as a 16 16 fragment from the Lena image, yielding a 256-dimensional vector. Fig. 7(c) depicts the two-dimensional convolution (inner-products over a sliding window) of the 8-bit image with the 8-bit template computed with full precision. The same computation performed by the Kerneltron, with 4-bit quantization of the image and template and 8-bit quantization of the output, is given in Fig. 7(d) . Differences are relatively small, and both methods return peak inner-product values (top matches) at both eye locations in the image. 5 The template matching operation is representative of a support vector machine that combines nonlinearly transformed inner-products to identify patterns of interest.
C. Large-Scale Computation
The design is fully scalable, and can be expanded to any number of input features and support vectors internally as limited by current fabrication technology, and externally by tiling chips in parallel.
The dense CID/DRAM multiply-and-accumulate cell ( , where is the technology scaling parameter) supports the integration of millions of cells on a single chip in deep submicron technology, for thousands of support vectors in thousand dimensional input space as the line-width of the fabrication technology continues to shrink. The quantizer area overhead is less than 75% and becomes insignificant with larger array sizes for the same output resolution. In 0.18 m CMOS technology (with m), 64 computational arrays with 256 128 cells each can be tiled on a 8 mm 8 mm silicon area, with two million cells integrated on a single chip.
Distribution of memory and processing elements in a finegrain multiply-and-accumulate architecture, with local bitparallel storage of the coefficients, avoids the memory bandwidth problem that plagues the performance of CPUs and DSPs. Because of fine-grain parallelism, both throughput and power dissipation scale linearly with the number of integrated elements, so every cell contributes one kernel unit operation and one fixed unit of dissipated energy per computational cycle. Let us assume a conservative cycle time of 10 s. With two million cells, this gives a computational throughput of 5 The template acts as a spatial filter on the image, leaking through spectral components of the image at the output. The Lena image was mean-subtracted. 200 GOPS, which is adequate for the task described in Section II-B. The (dynamic) power dissipation is estimated 6 to be less than 50 mW which is significantly lower than that of a CPU or DSP processor even though computational throughput is many orders of magnitude higher.
D. Training
Training of a support vector machine entails a quadratic programming problem of dimensions square in the number of data points. In principle, the training can be formulated as a constrained Hopfield neural network, with a natural analog circuit implementation [25] . The problem with this approach is that the area of the implementation scales with the square of the number of data points, which becomes impractical for very large data sets or a real-time (online) setting.
The incremental SVM learning approach in [12] allows on-line training of the Kerneltron with minimal overhead in implementation resources. Every misclassified training vector is stored as a (margin or error) support vector in the array, and the corresponding (nonzero) coefficient is computed using a recursive matrix operation of dimensions square in the number of margin (not error) support vectors. Since the number of margin vectors is usually very small compared with the number of training vectors, computational savings can be significant. Recursive computation of the coefficient is conveniently implemented off-chip, using the inner-products computed efficiently on the array.
The limited number of templates on the Kerneltron requires a trimming scheme to eliminate (most) inactive support vectors to make room for new support vectors. Estimations of the coefficients can further be simplified for integrated implementation as reported in [26] . A decomposition algorithm such as [27] or [28] offers an equally efficient realization in hardware, but requires multiple passes through the data for proper convergence of the coefficients.
E. Applications
The Kerneltron benefits real-time applications of object detection and recognition, particularly in artificial vision and human-computer interfaces. Applications extend from SVMs to any pattern recognition architecture that relies on computing a kernel distance between an input and a large set of templates in large dimensions.
Besides throughput, power dissipation is a main concern in portable and mobile applications. Power efficiency can be traded for speed, and a reduced implementation of dimensions similar to the version of the pedestrian classifier running on a Pentium PC (27 input features) [4] , [5] could be integrated on a chip running at 100 W of power, easily supported with a hearing aid type battery for a lifetime of several weeks.
One low-power application that could benefit a large group of users is a navigational aid for visually impaired people. OpenEyes, a system developed for this purpose [8] currently runs a classifier in software on a Pentium PC. The software solution offers great flexibility to the user and developer, but limits the mobility of the user. The Kerneltron offers the prospect of a low-weight, low-profile alternative.
V. CONCLUSION
A massively parallel mixed-signal VLSI processor for kernelbased pattern recognition in very high dimensions has been presented. Besides support vector machines, the processor is capable of implementing other architectures that make intensive use of kernels or template matching. An internally analog, externally digital architecture offers the best of both worlds: the density and energetic efficiency of a charge-mode analog VLSI array, and the convenience and versatility of a digital interface.
An oversampling configuration relaxes precision requirements in the quantization while maintaining 8-bit effective output resolution, adequate for most vision tasks. Higher resolution, if desired, can be obtained through stochastic encoding of the digital inputs [24] .
A 256 128 cell prototype was fabricated in 0.5 m CMOS. The combination of analog array processing, oversampled input encoding, and delta-sigma analog-to-digital conversion yields a computational throughput of over 1 GMACS per milliwatt of power. The architecture is scalable and capable of delivering 200 GOPS at 50 mW of power in a 0.18 m technology-a level of throughput and efficiency suitable for real-time SVM detection of complex objects on a portable platform.
