n many signal processing applications, such as those used for video compression and data communication, power efficiency has mostly been ignored until recently. The reason for this was that the more power-consuming components in most integrated systems have been the data transferring modules, such as storage and display, rather than data processing modules. With the advances in storage and display technologies and the ever increasing complexity in processing requirements, power consumed in signal processing suddenly became a topic of interest in the early '90s. The increasing demand for portability also fuels the need to design "cool" chips. This need for lowpower signal processing has spawned many excellent ideas and experiments from both industry and academia, who have made such tremendous progress that orders of magnitude in the reduction of power consumption have been repeatedly reported in the past few years. This article will give an overview of this power reduction process from the architectural and algorithmic design points of view.
n many signal processing applications, such as those used for video compression and data communication, power efficiency has mostly been ignored until recently. The reason for this was that the more power-consuming components in most integrated systems have been the data transferring modules, such as storage and display, rather than data processing modules. With the advances in storage and display technologies and the ever increasing complexity in processing requirements, power consumed in signal processing suddenly became a topic of interest in the early '90s. The increasing demand for portability also fuels the need to design "cool" chips. This need for lowpower signal processing has spawned many excellent ideas and experiments from both industry and academia, who have made such tremendous progress that orders of magnitude in the reduction of power consumption have been repeatedly reported in the past few years. This article will give an overview of this power reduction process from the architectural and algorithmic design points of view.
In the near future, access to digital data will be extended beyond the workplace through high-bandwidth network backbones ranging well above 1 Gb/s. Network storage and compute servers will provide a number of new services, including online news and transportation information, on-demand educational and entertainment video, and remote access simulation and visualization. Ubiquitous portable access will be provided to the end user by digital radio transceivers. This ability to have data wirelessly communicated to users provides a challenge in the design of future portable systems, while yielding an additional degree of freedom in their implementation. For example, to be able to wirelessly receive and display video data at power levels consistent with portable operation will require substantial improvements in present compression algorithms and circuit implementations.
One of the primary objectives in the design of portable systems is power reduction. It might be hoped that the improvements in fabrication technology could be the answer to providing low-power computation. Unfortunately, the recent "advances" in microprocessor design, which have reported single-chip power dissipation levels upwards of 30 W, demonstrate that full exploitation of higher clock rates results in dramatic increases in power dissipation, not the needed reductions. However, as will be shown, the increased density, ultra-low-voltage operation, and reduced logic delay of scaled technologies can be coupled with an architectural design strategy to dramatically decrease the power requirements.
In this article, a portable video-on-demand system [1] will be used as an example to demonstrate the low-power design principles which are widely applicable to other wireless data processing systems. Video on demand requires a bandwidth far greater than broadcast video, because each user can subscribe to different video programs at any time, anywhere. Because of the large bandwidth demanded in both storage and transmission, data compression must be employed, requiring real-time video decoding in the portable unit. The key design consideration for portability is reduction of power consumption to allow extended battery life.
Besides the constraint on power, the compressed video signals delivered by wireless communication are often damaged in transmission. Since wireless channel characteristics usually cannot be predicted beforehand, the received video quality may be degraded. To design a portable video-on-demand system that is sufficiently resistant to transmission errors, the compression algorithm needs to not only deliver good compression performance, but also maintain a high degree of error tolerance. To meet these goals, the design of a portable video-on-demand system must satisfy the following three criteria: low power consumption, high compression efficiency, and channel error resilience.
Of the three criteria, minimal power consumption is the guiding principle for both algorithm development and system trade-off evaluation. Our video decoder operates at a power level that is more than two orders of magnitude below that of comparable decoders in similar technology. This tremendous savings in power consumption was attained through both algorithm reformulation and architectural innovations specifically targeted for energy conservation.
This article is organized as follows. The next section dis-
I
Low-Power Signal Processing System Design for Wireless Applications cusses low-power design principles; the section following that offers a brief overview of the compression algorithms developed for our portable video system and its error-tolerant capability for wireless communication under various channel error conditions. We then discuss the low-power architectural strategies employed in designing our decoder chipset. The amount of power savings achieved in each step will be quantified and substantiated by measured data. Finally, we offer our conclusions and outlook for the future.
Low-Power Design
Even with the most advanced wireless link connected to the network backbone, there remains considerable signal processing computation in the implementation of high-speed radio modems and audio/video signal processing. To obtain the lowest-power solution, optimization over all aspects of system implementation must be employed, including the algorithms, architectures, circuit design, and fabrication technology [2] .
System Design Principles
The main contribution to power consumption in complementary metal oxide (CMOS) circuits is attributed to the charging and discharging of parasitic capacitors that occur during logical transitions. The average switching energy of a CMOS gate (or the power-delay product) is given by the following equation:
where C avg is the average capacitance being switched per clock, and V dd is the supply voltage. The quadratic dependence of energy on voltage makes it clear that operating at the lowest possible voltage is most desirable for minimizing the energy per computation; unfortunately, reducing the supply voltage comes at the cost of a reduction in computational throughput. One way to compensate for these increased delays is to use architectures that reduce the speed requirements of operations while keeping throughput constant. In signal processing applications, as distinct from general-purpose computing, it is throughput that is the design constraint instead of attempting to compute as quickly as possible. For example, in the case of video decompression there would be no advantage in decompressing faster than the 30 ms display update rate. One architectural approach for maintaining throughput with slower circuitry is to use parallelism through hardware duplication. By using identical units in parallel, the speed requirements on each unit are reduced, allowing for a reduction in voltage.
For example, consider a datapath module running at a frequency of f while switching an average capacitance C avg avg at a supply voltage of 5 V. Parallelizing by a factor of two will result in two units working at half the clock frequency while maintaining the original throughput. Since the speed requirements on each module have been lowered by a factor of two, the supply voltage can be lowered until the gate delays increase by a factor of two, which corresponds to a voltage of 2.9 V from simple first order theory [2] . By allowing the circuit to operate at half the speed while maintaining the same system throughput, the power of the overall system then becomes Power = 2C avg · (2.9
which is approximately three times lower than the original implementation with one unit. Likewise, pipelining is a similar technique that can be exploited to recoup speed loss at lowered supply voltages, thus enabling a significant reduction in power. These system design strategies make it possible to trade off silicon area for reduced power consumption, certainly reasonable for portable devices, since improvements in technology place availability of transistors as a secondary concern to power dissipation. The issue of how far this approach can be taken is interesting to address. First, algorithms must embed sufficient parallelism so that the requirement on throughput can be achieved through parallel processing. Sequential algorithms, such as the Viterbi decoding used in decoding convolutional codes and eliminating intersymbol interference in data communication, will need to be reformulated for a low-power implementation [3] .
One limit is reached when the noise margin of devices is degraded at extremely low supply voltages, although the slower clock rates may in turn reduce the ground bounce problems. However, before this limit is reached the overhead in increased parallelism to compensate for the reduced logic speed begins to dominate, increasing the power faster than the voltage reduction decreases it. Through a number of examples, the optimum supply voltage was found to occur in the range between 1 V and 1.5 V, resulting in an order of magnitude lower power dissipation than conventional "lowpower" 3.3 V scaling [4, 5] .
Understanding the dependency of energy consumption on supply voltage, the next step is to develop low-power circuit techniques so that architectural trade-offs can be made based on the power dissipation of different types of circuit operations.
Architectural Design Principles
A more complete discussion of low-power CMOS circuit design is given in [4] . In our low-power cell library design, the static CMOS logic style was chosen for its reliability and relatively good noise immunity. Table 1 summarizes the hSPICE simulations of the energy consumed per operation in our cell library, which gives us a guideline for eliminating some operations in favor of others in designing low-power signal processing architectures. For example, an SRAM write access consumes nine times, an off-chip I/O access ten times, and a multiplication approximately four times more energy than a simple add operation. Given that it is possible to implement an algorithm in many different ways, the motivation is to replace memory accesses with computation, since computation dissipates less energy per operation. This trade-off is crucial to architectural design, as will be discussed later.
The relatively large energy dissipated by off-chip I/O accesses is the motivation for minimizing external data com- munication. In our design, we eliminated external frame buffers and placed memory on chip whenever possible.
Because high-quality video implies high pixel bandwidth with at least millions of pixel operations per second, external frame buffers lead to huge energy waste and overwhelm all other low-power improvements made. Requiring no off-chip memory support is one of the main factors enabling our portable decoder to deliver high-quality video at a minimal power level.
Video Compression for Wireless Communication
Most low-power video encoder/decoder designs assume standard algorithms [6] [7] [8] , and therefore these designs either cannot meet our extremely low power requirements (less than 10 mW) [9] [10] [11] , or have to sacrifice the compression performance [4] . Since our application is the delivery of video-on-demand, the quality of decompressed video must not fall below the level of industry standards. The selection of compression algorithms, therefore, should be directed by their hardware implementations to achieve the goal of minimal power dissipation.
Our strategy is to design a compression algorithm with performance meeting that of industry standards, while at the same time requiring only minimal power in both encoding and decoding operations.
Standard compression algorithms used in most "wired" applications such as the PC-based multimedia functions (JPEG, MPEG, and H.261) deliver fairly high compression efficiency, capable of compressing image data by a factor of 20 and video data by a factor of 80 without introducing visible quality degradation. Figure 1 displays the quality of decompressed Man using the JPEG compression procedure at compression ratios of 12 (Fig. 1b) and 48 (Fig. 1c) .
Error-Correcting Codes vs. Error-Resilient Compression
Wireless communication, however, usually requires digital data to be protected against channel error, because the communication links tend to be less reliable than in a wired environment. One major feature of the standard compression algorithms is that their compressed bit rate is not constant, but a function of image content. Variable-rate codes are highly susceptible to channel error, since a single bit error can cause error propagation and produce totally erroneous decoded results until the next synchronization point. Figure 1d shows the decompressed image corrupted by random bit errors at a rate of 5 x 10 -3 , which demonstrates the effects of error propagation as a result of variable-rate codes. Because our portable video-ondemand system is to be embedded in a wireless communication environment, the first step in designing a suitable compression algorithm is to analyze the wireless channel characteristics. Experiments with mobile receivers and transmitters showed that received signal power may experience fades exceeding 10 dB approximately 20 percent of the time, and 15 dB (measuring limit) up to 10 percent of the time [12] , causing bursty transmission bit errors in the received data stream [13] . To "randomize" burstiness, the transmitted data stream can be interleaved over a certain period of time so that if consecutive bit errors occur (as is typical with bursty bit errors) in the received data stream, the error pattern appearing in the decoded data stream will resemble that of random bit errors. Since the time interval of deep fades is usually within tens of milliseconds while the data rate of compressed video is between 500 kb/s and 1.5 Mb/s, interleaving can easily be accommodated with a data buffer, the size of which is in the range of tens of kilobits. We will therefore consider transmission bit errors to be random in the received data stream. To reduce random bit errors, error-correcting codes are usually used. With the knowledge of a target channel bit error rate (BER) and the channel's signal-to-noise ratio (SNR), error-correcting codes can be very effective in correcting random bit errors. In wireless communication, however, the BER and the channel's instantaneous SNR vary widely over time and space, making a fixed-point design, as in the selection of an appropriate error-correcting code, a suboptimal solution at best.
For wireless and mobile channels that experience very low BERs most of the time, but with occasional intermittent severe channel degradation, error-correcting codes are not as effective as one might think. On the one hand, under low distortion conditions (low BERs), when almost perfect transmission is available, error-correcting codes incur unnecessary bandwidth overhead, generating redundant bits that could be used for representing compressed video data to improve video quality. This bandwidth overhead can be substantial, from 10 to 50 percent in recent mobile transceiver designs [14] . On the other hand, under severe distortion conditions in which the BER exceeds the designed capacity, error-correcting codes may actually introduce more decoded errors than received ones. On top of all this, error-correcting codes require additional hardware at the decoder unit, making the design of portable decoders a more difficult task.
Attempts to reduce the effects of channel error on image quality using combined channel and source coding have been proposed [15] . Recent developments in rate control algorithms also try to address this problem at the expense of retransmission and latency [16, 17] . Our approach to error tolerance, however, is to design error-resilient compression algorithms so that if channel distortion does occur, its effect will be a gradual degradation of video quality, and the best possible quality will be maintained at all BERs. In this way, we do not pay a high premium in bandwidth overhead when protection against error is not needed (under low channel distortion), and still deliver reasonable-quality video when error-correcting codes would have failed (under severe channel distortion). Furthermore, an overriding assumption automatically made in using error-correcting codes is exact binary reproduction of the transmitted bits. Because our goal is to transmit compressed video data -and human vision is fairly fault-tolerant -exact binary reproduction at the decoder is not necessary. As long as the effects of error are localized and do not cause catastrophic loss of image sequences, a robust error-resilient compression algorithm without resorting to error-correcting codes for data protection can be the solution, and our best compromise among the goals of consistent video quality, minimum transmission bandwidth, and low-power implementation.
Error-Resilient Compression
Our compression algorithm is based on subband decomposition [18] and pyramid vector quantization (PVQ) [19, 20] , which performs as well as the standard JPEG compression in terms of image quality and compression efficiency, while incurring much less hardware complexity and exhibiting a high degree of error resiliency [26] .
A Brief Overview of Subband Decomposition -Subband decomposition, as shown in Fig. 2 , divides each video frame into several subbands by passing the frame through a series of 2-D low-pass and high-pass filters, where denotes a downsampling operation by a factor of two. Each level of subband decomposition divides the image into four subbands. The image can be hierarchically decomposed to multiple levels. We refer to the pixel values in each subband as subband coefficients. Compression is achieved by quantizing each of the subband coefficients at a different bit rate according to the amount of energy in and relative visual importance of each subband.
Unlike the discrete-cosine transformation (DCT) used in standard compression algorithms, the subband approach does not introduce blocking artifacts. Subband-decompressed images are therefore usually considered more visually pleasing. In addition, the hierarchical nature of subband decomposition allows for flexibility in bit allocation, with different bit rates assigned to different subbands based on information content, visual importance, and subband size. Finally, subband decomposition provides many possible architectural trade-offs to achieve a low-power implementation.
A Brief Overview of Pyramid Vector Quantization -Pyramid vector quantization (PVQ) is a vector quantization technique that groups data into vectors and scales them onto a multidimensional pyramid surface. PVQ provides several distinct advantages when used in wireless transmission. First, it is a fixed-rate code, which results in hardware simplicity and prevents catastrophic error propagation. Although a single bit error in a fixed-rate encoded sequence can cause some deviation of decoded pixel values, at least the error would not corrupt the information on pixel positions. Because wireless channels are much noisier than wired channels, and their channel characteristics depend highly on the positions of and interference pattern between the transmitter and receiver at any point in time, a fixed-rate code that does not cause error propagation would be desirable for wireless transmission. Second, because of its regular lattice structure, PVQ allows for simple real-time decoding and encoding. Third, when optimized for the statistics of image data, PVQ provides excellent rate-distortion performance for moderate to high bit rates, achieving the compression performance of the best scalar-quantized variable-rate codes asymptotically [22, 23] .
Unlike the standard VQ schemes, which require codebook storage, product PVQ relies on intensive arithmetic computation to perform encoding and decoding [20] . The product PVQ encoding process is as follows: a data vector, formed with L values, is encoded by scaling the vector onto an Ldimensional pyramid surface and finding the nearest lattice point on the pyramid (Fig. 3) . Both the scaling factor and an index that corresponds to that lattice point are transmitted. The decoding process converts the index back into an L-dimensional vector and scales that vector using the scaling factor. Since the lattice points on the pyramid are regularly spaced and are described by recursive equations of combinatorial values, encoding and decoding PVQ indices is performed with arithmetic computation, using shifts, subtracts, lookups, and compares.
Robust Enumeration of the Multidimensional Pyramid -Error resilience can be achieved through permutation of the PVQ codebook indices to minimize the effect of bit errors on the indices. In general, close Hamming neighbors in the codebook indices should correspond to close spatial neighbors on the pyramid. Since PVQ codebooks are usually far too large to have such a permutation map stored in memory, we need to use enumeration methods that automatically yield robust indices. The enumeration schemes that we developed use conditional product codes [21] , which have a 3 dB advantage over the magnitude [19] and linear enumerations [24] under random bit errors.
Unlike linear and magnitude enumerations, which divide the subranges by the value of the vector elements, conditional product code enumeration forms subranges based on the number of nonzero elements in the vector. Since each nonzero element is symmetric in distribution, each has a corresponding sign bit. These sign bits can be packed into the least significant bits of the index. As long as the number of nonzero elements can be correctly decoded from the corrupted index, the number of nonzero elements and their signs will be decoded independent of each other, limiting the effects of random bit errors. Figure 4 shows the classification of pyramid points based on the number of nonzero elements.
Once the subrange representing the number of nonzero elements is determined, the pattern of the nonzero elements, the magnitudes, and the sign bits can be enumerated, forming a conditional product code to represent each pyramid index. The information structure of the conditional product code is shown in Fig. 5 .
Pyramid Vector Quantization for Images
Pyramid vector quantization has been previously used for DCT [25, 26] and subband [20, 21] image compression because the statistics of both the transformed and subband coefficients are similar to the Laplacian in distribution. PVQ has also been considered in the context of matched joint sourcechannel coding for images [27, 28] , where the channel statistics are stationary and known. The following two simulations demonstrate the advantages of our error-resilient PVQ coding schemes for subband image compression under varying channel conditions.
Fixed-Rate Subband/PVQ CodingAfter subband decomposition as shown in Fig. 2 , each frequency band is coded using a fixed-rate quantizer. The lowest frequency subband (the DC band) is quantized with a Gaussian Lloyd-Max scalar quantizer. The higher-frequency subbands are PVQ-encoded with vector dimensions ranging from 4 to 64, depending on the energy variance of each subband. The bit allocation for the PVQ quantizer is determined through integer bit allocation techs Figure 3 . PVQ encoding on a 3D pyramid surface.
Image data vector (y 1 , y 2 , y 3 )
s Figure 4 . Classification of the lattice points on the pyramid by nonzero elements. To demonstrate the compression performance of this fixed-rate subband/PVQ algorithm, we applied it to the USC database images Lena, Lake, Couple, and Mandrill. As shown in Fig. 6 , it outperforms JPEG at all bit rates of interest (JPEG with scaled default quantization matrices and customgenerated Huffman tables). Image quality was measured by peak signal-to-noise ratio (PSNR), defined as the maximum signal power over the squared compression error: (1) where P is the peak signal power, s(n) the original pixel value, and s(n) the decompressed pixel value. To show the quality of decompressed images, Fig. 7 displays the image Man compressed using our subband/PVQ scheme at compression ratios 12 ( Fig. 7a) and 48 (Fig. 7b) .
Channel Robustness -For our channel robustness experiments, we use a simple channel model. The encoder always transmits the same compressed image regardless of the actual channel because it does not know how many receivers are listening or the channel noise characteristics at each receiver. The receiver, on the other hand, has knowledge of its own individual channel noise conditions -both the average bit error rate and the presence of total signal loss in a deep fade. This model conveys some of the problems associated with mobile communication [12, 31] .
In our PVQ experiments we only protect the scalar quantized DC band by repeating the two most significant bits of each index three times. This incurs negligible overheadabout 0.016 b/pixel -and requires just a simple majority decoder to correct bit errors in the DC band. All other bands are PVQ-encoded and are not protected to show the inherent error resilience offered by the product enumeration technique.
For comparison we use the JPEG coder with resynchronization every six macroblocks (JPEG-R) and the JPEG coder with resynchronization and (2,1,6) Viterbi decoding (JPEG-R (2,1,6)VD) [32, 33] . These JPEG implementations are compared with the product enumerated PVQ technique in Fig. 8 for Lena at 0.5 b/pixel. Clearly, PVQ offers both better intrinsic coding performance and additional error resilience. Only with sophisticated channel coding can the performance of JPEG be made error-resilient, but that significantly reduces the noiseless source coding performance.
On the JPEG with the error correction curve we notice that the error-correcting code does an excellent job of eliminating bit errors, but starts to fail at BERs around 3 x 10 -2 . Once the BER increases past this point catastrophic errors occur, causing severe block loss and a rapid drop in PSNR performance. Our fixed-rate code, however, maintains a gradual decline in quality, even under severe BER conditions. This gradual degradation in performance makes the subband/PVQ scheme well suited for situations where image quality must be maintained under deep channel fades and severe data loss, a characteristic of wireless transmission.
To show the effects of bit errors on decompressed images, we used the Mandrill image compressed at 0.66 b/pixel as an example. We corrupted the transmitted data at BERs of 10 -3 and 2 x 10 -2 . Figure 9 shows decompressed images under such channel conditions. At the BER of 10 -3 no noticeable artifact caused by channel error is present in either the subband/PVQ-compressed (Fig. 9a) or the JPEG-compressed (Fig. 9c) image, indicating that the error-correcting code is quite effective in correcting random bit errors at this rate. At a BER of 2 x 10 -2 , the subband/PVQ-compressed image (Fig.  9b ) is still recognizable with reasonable quality, although bit errors cause ringing artifacts, locally distorting the image. However, in the JPEG-compressed image (Fig. 9d) , bit errors become very noticeable when error propagation occurs, causing total block loss.
As for the quality of compressed video, when the BER approaches 10 -3 the random loss of blocks in JPEG-compressed video sequences becomes evident, causing continuous flickering that makes it difficult to view the video. In subband/PVQ-compressed video, increasing bit errors cause increased blurriness and wavering artifacts, but most of the video details remain distinguishable for BERs past 10 -2 . 
It is difficult to quantify the exact amount of power consumed by off-chip memory access, which depends on the memory technology used and is usually much higher than tens of milliwatts.

A Low-Power Portable Video Decoder
Compared with the C-Cube JPEG decoder [34] , implemented in 1.2 µm CMOS technology and dissipating approximately 1 W while decoding 30 frames of video/s, our subband/PVQ decoder is more than 100 times more power efficient, not counting the power dissipated in accessing off-chip memory necessary in the JPEG decoding operation. 1 Within this factor of 100, a factor of 10 can be easily obtained by voltage scaling of the power supply, which can be applied to existing designs. Reduced supply voltage, however, increases circuit delay. To maintain the same throughput, or real-time performance, the hardware must be duplicated, increasing the total chip area by at least the same amount. The fact that our video-rate decoder can be implemented with less than 1,500,000 transistors (including 300,000 transistors for on-chip memory) operating at a power supply voltage less than 1.5 V, without requiring any off-chip memory support, indicates the efficiency of our decompression procedure.
A Low-Power 2D Subband Decoder
In designing the subband decoder chip [5] we emphasized a low-power implementation without introducing noticeable degradation in decompressed video quality. Since memory accessing is by far the most power consuming operation, the main design strategy has been to eliminate memory accesses in favor of on-chip computation.
Filter Implementation -A major consideration in subband coding is the choice of filters. This determines the amount of computation required and the number of line delays needed, as well as affecting the algorithm's compression performance. Extensive simulations were performed to select the filter, resulting in a short four-tap asymmetric wavelet filter (3,6,2,-1) as the low-pass kernel filter [35] . This filter performs comparably to the more commonly used nine-tap quadrature mirror filter but uses only one third of the power and greatly reduces the line delay memory requirements. The filter implementation, as shown in Fig. 10 , uses shifts and adds to implement multiplications of the simple filter coefficients, requiring only a single 3-2 adder. The same hardware also implements the high-pass filter by reversing the coefficients and negating the (6, -1) coefficient pair. The rounded results are stored in the line-delay memory and passed to the vertical filter, which operates nearly identically to the horizontal filter but with one input from the line-delay memory and the other from the output of the horizontal filter, forming two vertically oriented values.
Internal Precision of the Datapath -In digital signal processing design, the word length of internal data representations is an important parameter determining the accuracy of the final output. Because video signals at an SNR higher than 46 dB usually appear nearly perfect, a word size of 12 bits is usually used, of which 8 bits are necessary for the input analog-digital conversion and 4 bits for internal calculation to avoid overflow/underflow. Our simulations indicated that, when using rounding instead of truncation in quantization, a word size two bits smaller will achieve a similar output SNR. This allows a word size of only 10 bits in our internal data representation, reducing the power and area of all datapath circuits, including the internal memory which uses the most silicon area. This 10-bit rounding strategy provides a power reduction of nearly 20 percent over the 12-bit one and a power reduction of 40 percent over the standard 16-bit approach with no perceptible loss in video quality.
Memory Design -The size of the required internal memory is critical for achieving a low-power implementation. First, the power consumed in the memory increases with its size. Second, the size of these memory units determines if they can be kept on chip, greatly reducing the I/O power. Therefore, a power-expensive frame buffer is eliminated by generating the output data in the raster scan s External Accesses -External data accesses represent another major energy-consuming operation that need to be minimized. To reduce the communication bandwidth between the PVQ decoder and the subband decoder chips, zero runlength encoding is used to take advantage of the large number of zeros in the three highest-frequency subbands. The PVQ decoder transmits to the subband decoder the values of nonzero coefficients and a zero runlength to the next nonzero coefficient. Consequently, power is saved by reducing the size of the PVQ output buffer and the number of external accesses. By accepting zero runlengths between nonzero coefficients, the number of external reads per pixel for the subband decoder chip is reduced from 1.5 to 0.57. Without considering external access reductions, the total power of the subband chip would have been 40 percent more.
Color Conversion -The chip includes the YUV-to-RGB color conversion operation because of its potentially high power dissipation if implemented off-chip. The conversion algorithm is greatly simplified but with no visual degradation in image quality [36] . This provides an efficient implementation requiring only five carry-select adds and one carry-save add per pixel per RGB component. Our color conversion circuit consumes only 90 µW at a 1 V supply for three components at 1.27 Mpixels/s.
A video timing controller included in the chip regulates reading the YUV data from the final-result buffer (as shown in Fig. 10 ) and the generation of the RGB outputs. Programmable timing parameters for vertical and horizontal synch and blanking intervals specify the video synchronization signal needed by the display device. The RGB outputs were sent directly through a DAC to the display device without the need for a high-power frame buffer.
Performance -The subband decoder chip, implemented in 0.8 µm CMOS technology, occupies a chip area of 9.5 x 8.7 mm 2 (with an active area of 44 mm 2 ) and contains 415,000 transistors. At a 1 V supply voltage, the chip runs at a maximum frequency of 4 MHz and delivers one output pixel of three color components at every two clock cycles. To satisfy the throughput requirement of 1.27 Mpixels/s (176 x 240 pixels/frame at 30 frames/s), the subband decoder chip only needs to run at 3.2 MHz. This fairly low operating frequency is an indication of the efficiency of our compression algorithm. Even for video of the standard SIF format (352 x 240 pixels/frame at 30 frames/s), the subband decoder chip only needs to run at 6.4 MHz, while most JPEG-based decoding chips are required to run at a frequency between 30 MHz and 50 MHz [34] .
The peak performance at 5 V generates 60 Mpixels/s of RGB components with a 120 MHz operating frequency while dissipating 1.2 W. For a target rate of 3.2 MHz, the chip provides significant excess throughput, allowing it to meet the s Figure 10 . Subband filtering datapath design. requirements with a 1 V supply while dissipating under 1.2 mW. Figure 11 illustrates the power dissipation at the maximum operating frequency for various supply voltages.
Multiresolution Solution -This decoder chip was designed using a modular approach and therefore supports higher-resolution images as well. For higher-resolution images, multiple chips would be cascaded, each operating on a maximum of 256-pixel-wide slice, producing a final image without boundary artifacts. Table 2 illustrates the power dissipation of multiple subband decoder chips at the required clock frequency when used for decompressing high-resolution images.
To maintain low power consumption for high-resolution displays, based on the design principles outlined in the second section, parallel chips running at the lowest possible clock rate should be used. The number of chips is determined by the number of pixels in the width of the display device. Once the number of chips is determined, the lowest possible clock rate can be calculated based on the number of pixels in the height of the display device. Given the lowest possible clock speed, the operating voltage is determined. This additional chip-level parallelism keeps the operating frequency, and thus the supply voltage, low, resulting in extremely low power dissipation even for HDTV applications.
A Low-Power Video-Rate PVQ Decoder
The block diagram of the low-power PVQ decoder is shown in Fig. 12 , which divides the PVQ decoding procedure into four processing blocks. The stream parser parses incoming 16-bit words into PVQ indices and scaling factors using a series of two 32-bit-wide barrel shifters. The various word lengths of indices and scales are stored in a ROM. The index predecoder (16-bit datapath) decodes each index into the four properties described earlier. The vector decoder (16-bit datapath) generates a data vector based on the vector properties by iteratively comparing and subtracting these indices from precomputed combinatorial offsets stored in a ROM. Finally, a 6 x 8-bit pipelined multiplier performs the final scaling of each decoded vector element using the scaling factor sent with each PVQ index. The multiplier comprises a 4-2 adder tree with a carrypropagate adder. FIFO buffers separate the four processing blocks and regulate data flow between them.
In addition to operating at a low supply voltage, the PVQ decoder chip employs several key architectural strategies to minimize its power consumption.
Parallel Processing -First, we achieve maximum throughput using parallelism and pipelining so that excess performance can be traded off for lower power consumption through voltage scaling. The chip's critical path, found in the vector decoder, is optimized to increase throughput. This allows for lower-voltage operation and reduces the output buffering required to meet real-time constraints. Unlike the direct algorithm implementation, which uses a linear search to locate the correct index offset, improved throughput is achieved by searching and processing a block of four combinatorial offsets at a time. For typical image data, this reduces the average number of search iterations from 15 to three, and halves the number of processing cycles and amount of output buffering.
The chip incorporates four independent processing blocks, each individually pipelined. Because the PVQ decoding algorithm is inherently nondeterministic (i.e., the number of steps to decode an index depends on vector data), the latency in the index predecoder and vector decoder units is also nondeterministic. Dividing the chip into four processors separates the dependencies between the various blocks and maximizes the chip throughput. Each processor is separated by FIFOs and only continues when its input FIFO is not empty and its output FIFO is not full. When idle, each processor enters a standby mode to save power.
Power-Efficient FIFO Design -The FIFO buffering regulates clock gating of each processor, helps smooth out otherwise erratic data flow between processors, and guarantees constant data flow at the chip output, which is required by the subband decoder chip that follows. In order to guarantee constant data s Figure 12 . Architecture of the PVQ decoder chip. flow at the chip output, behavioral simulation using Verilog was performed with worst-case input data to carefully determine the FIFO sizes.
The chip incorporates an energy-efficient register-based FIFO design which uses a pointer-based scheme. When the FIFO is accessed, only the pointer, stored in a single-bit shift register, moves -not the data. Compared to other registerbased FIFO designs, where significant power can be consumed shifting data between registers, this scheme minimizes power by minimizing data switching. Additionally, internal clocking within the FIFO is also turned off when the FIFO is idle. This design performs simultaneous read and write, which allows each processor to operate entirely independent from other processors.
Clocking Methodology -Because clocking makes up a significant portion of the total chip power, careful design of the clock methodology, distribution, and gating was performed. The PVQ chip utilizes a single-phase clocking scheme. The basic register cell used throughout the design is a variation of that proposed by Larson and Svensson [37] . The clock distribution is a standard tree configuration with a central clock buffer chain located at the center of the chip, driving the global chip clock to each processor datapath and control.
More than half the chip's total clock capacitance lies in gated clocks. With the exception of the vector decoder, whose idle time is less than 10 percent, the other processing units are typically idle 50-60 percent of the time. This high idle time probability leads to savings in total clocking power by a factor of 2.
External Accesses and Zero Runlength Encoding -Combinatorial calculation and index offsets, required for the vector decoding unit, were stored on-chip in low-power ROMs. In addition, parsing information for subband bit allocation was also stored in this ROM. This scheme essentially trades off greater programmability for less external accesses to off-chip memory.
In addition, output data vectors that contain large numbers of zeros, commonly found in high-frequency subband data, are zero runlength encoded to losslessly compress the representation of consecutive zeros. This encoding reduces the amount of external chip accesses by a factor of 3, output buffering by a factor of 3, and the number of internal buffer accesses by up to a factor of 10. Here, we traded off additional control complexity to perform zero runlength encoding for lower I/O power and on-chip data buffering.
Measured Performance -The PVQ decoder chip must meet the throughput requirement of real-time video decoding at 176 x 240 pixels/frame at 30 frames/s (1.27 Mpixels/s). From this performance figure we calculated the maximum delay allowed at the lowest power consumption. A clock frequency of 6.4 MHz was found to be adequate, although a PVQ decoder chip achieving a maximum clock frequency of 18 MHz at 1.5 V has been designed in 0.8 µm CMOS technology.
The PVQ decoder chip occupies a chip area of 9.7 x 13 mm 2 (with an active area of 74.9 mm 2 ). Because the PVQ decoder chip needs to operate at a frequency twice that of the subband decoder chip, to meet the video-rate throughput at 1.27 Mpixels/s the PVQ decoder chip needs to run at 6.4 MHz, powered by a 1.35 V supply. At this supply voltage, the chip performs real-time video decoding at a peak computation rate of 21 million operations per second (MOPS) and dissipates 6.7 mW (with an output load of approximately 4 pF/pin). Figure 13 shows the measured power at the maximum frequencies for various supply voltages.
The PVQ decoder consumes roughly five times the power of the subband decoder chip. The doubled clock frequency of the PVQ decoder explains much of this difference. A third of the chip power is consumed by the FIFOs with approximately 55 percent of this power dissipated on local clock lines. Using a memory-based design instead of register-based FIFOs would have helped reduce this power as well as area.
It is also interesting to note that the datapath power roughly equals the control power. There are several reasons for this:
• The datapaths use local gated clocks, while the control sections do not -a significant factor in that each processing unit may be idle up to 60 percent of the time.
• The control sections require greater complexity due to the nondeterministic nature of the PVQ decoding algorithm and the data reordering scheme. Because of the relatively large design, extensive global clock routing results in significant power consumption. Overall, the total power consumed by clocking, including global clocking and local clocking in the control, datapaths, and FIFOS, makes up about 50 percent of the chip power. Finally, the power consumed in external accesses is a relatively small fraction, approximately 30 percent, because the design limits off-chip accesses to only compressed data at both the input and output pins. Figure 14 shows the block diagram of the overall video decoder design. The subband/PVQ-encoded bitstream is transmitted through a wireless link with a bandwidth of 500 kb/s-1 Mb/s. The received bitstream is decoded by the PVQ decoder chip into subband coefficients. The subband decoder chip then filters these coefficients and reconstructs them into image pixels, each represented by three digital RGB color components. The PVQ and subband decoder chips form our decoder chipset. The digital RGB color components are converted to an analog format by a DAC suitable for a color LC display. Our design is a current-switching DAC based on a currentmirror configuration with a bias current of 5 µA. The transistors in the current mirror were sized for digital inputs at extremely low voltages, delivering a resolution of 6 b/pixel, an adequate resolution for the 4 inch color display. The DAC generates analog current outputs that can be converted to the required dynamic range through gain-control opamps. The maximum frequency of the DAC runs up to 10 MHz at a supply voltage of 1.5 V, and each conversion operation consumes on average 42 pJ of energy per color component.
Portable Video Decoder
Portable Video-on-Demand Prototype
As shown in Fig. 15 , the complete video-on-demand system consists of a portable decoder with a color LC display that decodes and displays compressed video sequences, a radio transmitter and receiver, and an encoding base station implemented by a DSP multiprocessor board. Wireless data transmission is provided by three pairs of direct-sequence spread spectrum radio transceivers manufactured by Proxim, delivering a raw data rate at 720 kb/s. The decoding chipset on the portable decoder receives compressed video data and decompresses them to RGB color components, which are then converted to analog signals for the color display. The display is a 4 inch color thin-film-transistor active matrix display with a resolution of 160 pixels x 234 lines.
Conclusions
The design of low-power electronics systems, especially portable systems using wireless communication, requires vertical integration of the design process at all levels, from algorithm development to system architecture to circuit layout. System performance need not be sacrificed for lower power consumption if the design of algorithms and hardware can be considered concurrently.
At the algorithm level we developed coding techniques that reduce the susceptibility of image quality to channel noise by up to 3 dB over existing methods for no additional coding overhead. These error-resilient PVQ techniques, when combined with subband coding, deliver better compression performance than the JPEG image compression standard, and with much greater robustness as well. Furthermore, the proposed subband/PVQ coding scheme allows an extremely low-power implementation for both encoder and decoder design, achieving the goals of error resilience, high compression efficiency, and low power consumption simultaneously.
From designing this portable videoon-demand system, we learned that power reduction can be best attained through algorithmic and architectural innovations, guided by the knowledge of underlying hardware and circuit properties. This hardware-driven algorithm design approach is key to the design of future portable systems under stringent power budgets.
s Figure 15 . The portable video-on-demand system. 
