A recent trend in low power design has been the employment of reduced precision processing methods for decreasing arithmetic activity and average power dissipation. Such designs can trade off power and arithmetic precision as system requirements change. This work explores the potential of Distributed Arithmetic (DA) cornputation structures for low power precisionon-demand computation. We present two proof-ofconcept VLSI implementations whose power dissipation changes according to the precision of the computation performed.
Introduction and Background
A recent trend in low power design has been the demployment of reduced precision "approximate processing" methods for reducing arithmetic activity and chip average power dissipation. Such designs treat power and arithmetic precision as system parameters that can be traded-off vs. each other on and ad-hoc basis. Ludwig et. al [I] have demonstrated an approximate filtering technique which dynamically reduces the filter order based on the input data characteristics.
More specifically, the number of taps of a frequencyselective FIR filter is dynamically varied based on the estimated stopband energy of the input signal. The resulting stophand energy of the output signal is always kept under a predefined threshold. This technique results in power savings of a factor of 6 for speech inputs. Larsson and Nicol [21 131 have demonstrated an adaptive scheme for dynamically reducing the input amplitude of a Booth-encoded multiplier to the lowest acceptable precision level in an adaptive digital equalizer. Their scheme simply involves an arithmetic shift (multiplication/ division by a power of 2) of the mul- [7] [SI. When used appropriately it features stochastically monotonic successive approximation properties. In this work, we present the theory behind Distributed Arithmetic and its approximate processing propelties. We also present two proof-of-concept VLSI implementations, a heartbeat classifier and a DCT core processor whose power dissipation characteristics change on-the-fly according to the precision of the computation performed.
Distributed Arithmetic
Distributed Arithmetic (DA) [4] [SI is a bit-serial operation that computes the inner product of two vectors (one of which is a constant) in parallel. Its main advantage is the efficiency of mechanization and the fact that no multiply operations are necessary. DA has an inherent bit-serial nature. but this disadvantage can he completely hidden if the number of bits in each variable vector coefficient is equal or similar to the numher of elements in each vector.
As an example of DA mechanization let us consider the computation of the following inner (dot) product of M-dimensional vectors a and x, where a is a constant vector:
Let us further assume that each vector element xk is an N-bit two's complement binary number and can be represented as where bki E {0,1} is the ith bit of vector element xk. Please note that bM is the least significant bit (LSB) ofxk and bk(N-l) is the sign bit. Substituting eq. 2 in eq. ROM. The variable vector X is repackaged to form the ROM address most significant hit first. We have assumed that the Xi elements are 4-bits 2's complement (bit 3 is the sign bit.) Every clock cycle the RESULT register adds 2x its previous value (reset to zero) to the current ROM contents. iMoreover, each cycle the 4 registers that hold the four elements of the X vector are shifted to the right. The sign timing pulse T, is activated when the ROM is addressed by hit 3 of the vector elements (sign). In this case the adder subtracts the current ROM contents from the accumulator state. After four cycles (bitwidth of the Xi elements) the dot product has been produced within the RESULTregister.
Successive Approximation Using Distributed Arithmetic
In this section, we show that when the Distributed Arithmetic operation is performed MSB first, it exhibits stochastically monotonic successive approxi- 
We model qn as experimental values of a discrete random variable q. The underlying stochastic experiment is random accesses of the DA coefficient ROM in the presence of random inputs. The experimental values of q are the DA ROM contents. The first and second order statistics of the error term ei are:
where equations 11 and 16 have been computed under the assumption that the least significant bits bkn (i.large) are independent identically distributed random variables uniformly distributed between 0 and 1
. This is a valid assumption for input DSP data [9] [10]. The fact that equations 11 and 16 are monotonically decreasing functions of i (RAC cycles) shows the succesive approximation property (in probabilistic terms) of the Distributed Arithmetic mechanization.
In the next two sections we show two VLSI implementations that use the successive approximation properties described above to achieve power scalability.
DSP for Physiological Monitoring
An example of power scalable processing using Distributed Arithmetic is a low power DSP for physiological monitoring. The biomedical sensor is a microphone for recording heartbeats, breathing sounds, and voice data. This data will eventually be used to determine the physical condition of the wearer. The first step is detection of the heartbeats, which can be used to determine heart rate as the basis for a physiological assessment.
Evaluation of the spectrogram of the acoustic data indicates that most of the energy from heartbeat sounds lies in the low frequency range, below 200 Hz. We developed a classifier based approach to heartbeat detection that takes advantage of this spectral characteristic to improve detection performance in the presence of speech and other high frequency energy.
The basic algorithm is outlined below:
1. Preprocessing:
Lowpass Filtering: The data is bandlimited to below 200 Hz to eliminate as much of the voice and breath energy as possible.
Matched Filtering: The output of the lowpass filter is passed through a matched filter to determine the candidate heartbeat locations in the time domain.
Segmentation: The sensor output is divided into overlapping segments at least long enough to contain a full heartbeat in the time domain, but short enough not to contain more than one.
Feature Extraction:
A subset of seven features is computed from the matched filter output.
3.
Classification: Each feature vector is classified into a heartbeat or nonheartbeat using a parametric Gaussian multivariate classificr [ll] .
Assuming that the lowpass filtering occurs before sampling as an antialiasing step, the first computationally significant step is to perform the matched filtering. The archiiecrure of the proposed sensor DSP chip iolliiws the dgririthmic architecture described ab.xe. 'The di\crete-rime matched filter ii implcmcnied using the Dirtrihuted Arilhmeric Unit. Its outpu[ is then parsed io 3 nonlinear filtering unit 13 cslculate qusntities used in segmentstion The tindl segmentaiion. feature extraction, and classitication is performel by the programmdble microcontroller at the enJ i u produce ihc & s i aisignmcnt : . The huifer provides a mcchanim tor ,)nchrunization between the front end liltenng 3nd ih: backend processing. This is ncccaasry tor power redu-tion The filtering iront end must bc running conrinuoJsly tu prox*> the input ramplei, which arrive dt a fixed laic. Howe\cr. the back end c1sJsiiication only needs iu be perfonncd for every scgment, not every input sample. The syatsni operales as follow^: first. the front end filters [he input and untes impunani results to the buffer. ,\ small loop is continuousl) cxecuted in the microci~nirollcr. che;king 10 see if 3 full regment hss been uritten to [he buffer 'Thc riltering units could do !hi%, but i t inwl\ei adding circuits thai alrr.dy exist i n the ,\LU oithe microcuntroller. which is idle anyway while it IS trying tu detect a segnieni. 'lo consene area. we use the microconiroller rather than add ;umplexit) io ihc filler fundiondl units. When B segment is deiected, the nucroconirollcr eiecutes the featurc e x t r~; l i~n snd chssilication code on the data in the butier Using Distributed Arithmetic, a complete dot product can he performed in as many cycles as correspond to the hit widths of the input samples. If the hitwidth M is less than the filter length N , this implementation requires fewer clock cycles than a multiply accumulate. This is beneficial for long filters like the matched filter described above, where M > N . The reduced clock results in low total power not just through frequency reduction, but also through increased voltage reduction since the delay constraint of the DA filter critical path is much less stringent than the multiplyaccumulate architecture. Figure 4 shows the power reduction in the Distrihutd Arithemetic unit as the input quantization level is decreased (i.e. fewer hits of the input are shifted into the filter).
As discussed above, the bit-serial nature of the implementation also allows an alternative approach to approximate processing, By clocking the DA units at less than the full bitwidth, we are in effect reducing the input quantization level. This is roughly equivalent to injecting noise at the input of the filter. In a detection scheme like the heartbeat detection algorithm, this reduced signal 10 noise ratio should result in lower performance, i.e. less reliable detection of heartbeat events. However, the reduced performance has also resulted in reduced power since the switched capacitance per output filter sample decreases linearly with the number of input hits clocked in. Figure 5 shows the classifier performance reduction as the DA unit power is decreased. 
DCT Core Processor
As another example, consider a DCT core processor, The chip architecture and circuits are described in [12] , while this section focuses on the algorithmic issues in its implementation. The DCT processor has been implemented using distributed arithmetic computation units in a precision-on-demand configuration in order to reduce average power dissipation [131. This method exploits the fact that not all spectral coefficients have the same visual significance in an image or video processing application. vpically, a large number of high spatial frequencies are quantized to zero in a lossy image/ video compression environment such as P E G or MPEG with no significant change in visual quality. The DCT processor exploits such different precision requirements ou a coefficient hasis by reducing the number of iterations of the distributed arithmetic units that compute the visually insignificant spectral coefficients. A row-column classification scheme is implemented to furlher increase image quality while keeping arithmetic activity and power dissipation to a minimum: When the incoming pixel data exhibits increased (reduced) activity and low (high) correlation, the precision of the DA arith- The chip average power dissipation varies with arithmetic precision as expected. Figure 6 plots average chip power dissipation vs. compressed image quality in terms of the image peak SNR (PSNR), a widely used quality measure in the image processing literature. The datapoints on the graph have been obtained by chip power measurements at different RAC maximum iteration settings. The measurements imply that the chip can produce on average 10 additional dBs of image quality per milliwatt of power dissipation. Figure 7 displays the actual compressed images for three (power, PSNR) datapoints of Figure 6 . Figures 6 and 7 establish our claim that the present chip tradesoff image quality and power dissipation. A chip microphotograph is shown in Figure 8 .
