ABSTRACT power wireless video systems through the use of two examples, a single-chip digital video camera and a wireless video-on-demand system. The discussion will focus on the architectural and circuit techniques developed specifically for silicon integration of high-performance low-power wireless video systems. The proposed single-chip digital camera incorporates a parallel architecture to perform MPEG-2 encoding in real time, while the video-on-demand system employs an error-resilient compression algorithm to guard against the transmission errors often encountered in wireless communication. Both wireless video systems, one for encoding and the other for decoding, dissipate only tens of milliwatts of power, achieving a power reduction two orders of maanitude below standard solutions.
growing number of computer systems are incorporating multimedia capabilities for displaying and manipulating video data. However, our present ability to work with video has been confined to a wired environment, requiring both the video encoder and decoder to be physically connected to a power supply and a wired communication link.
This interest in video, combined with the popularity of portable devices, provides the impetus for designing portable wireless video systems.
One of the primary objectives in the design of portable wireless systems is power reduction. The increased density, ultra-low-voltage operation, and reduced logic delay of scaled complementary metal oxide (CMOS) technologies can be coupled with an architectural design strategy to dramatically reduce the power requirements. It has been shown that, in standard CMOS technology, a power efficiency of 100 billion operations per second per watt (BOPSIW) can be attained through the use of both architectural and circuit techniques [l] . Furthermore, supply voltage can be dynamically adjusted according to the computation demand in real time, delivering the absolute minimum amount of energy for any specific task [2] . Based on these developments in low-power signal processing, we believe that advanced signal processing techniques can be applied to improving the performance of wireless video systems without incurring a power penalty.
Today, compression standards, such as Joint Photographic Experts Group (JPEG) for still images and Moving Pictures Experts Group (MPEG) for video, dominate commercial systems. While offering good compression performance, these standards were not designed for portable wireless applications. Consequently, the video systems based on these standards often require large hardware and powerexpensive implementations, and therefore have not been widely used in portable wireless applications. Our first wireless video example, a single-chip digital video camera, attempts to quantify the minimal power necessary to encode high-resolution video in real time using the standard MPEG-2 algorithm.
Transmitting compressed video signals over wireless channels commands its own unique set of requirements. Compression efficiency is required because of limited channel bandwidth, and wireless transmission implies the need for error resiliency, since radio links can experience severe channel distortion with bit error rates (BERs) up to 10-2 or higher. Also, battery-operated portable devices require processing hardware that consumes low power. Our second wireless video example, a portable video-on-demand system, illustrates an integrated design approach that combines algorithmic, architectural, and circuit techniques to achieve the goals of high compression performance, error resiliency, and low-power implementation simultaneously.
This article is organized as follows. The following section discusses low-power design principles. The section after that describes the architectural design of a digital video camera chip with real-time MPEG2 encoding capability. The article next offers the low-power architectural strategies employed in designing a video-on-demand decoder chip-set. Finally, we offer our conclusions and outlook for the future.
LOW-POWER DESIGN PRINCIPLES
Even with the most advanced wireless link connected to the network backbone, there remains considerable signal processing computation in the implementation of real-time video compressionidecompression. To obtain the lowest-power solution, optimization over all aspects of the system implementation must b e employed, including the algorithms, architectures, circuit design, and fabrication technology 131.
SYSTEM DESIGN PRINCIPLES
The main contribution to power consumption in CMOS circuits is attributed to the charging and discharging of parasitic capacitors that occurs during logical transitions. The average switching energy of a CMOS gate (or the power-delay product) is given by the following equation:
where Cavg is the average capacitance being switched per clock and V d d is the supply voltage. The quadratic dependence of energy on voltage makes it clear that operating at the lowest possible voltage is most desirable for minimizing the energy per computation; unfortunately, reducing the supply voltage comes at a cost of a reduction in the computational throughput.
One way to compensate for these increased delays is to use architectures that reduce the speed requirements of operations while keeping throughput constant. In signal processing applications, as distinct from general-purpose computing, it is throughput that is the design constraint instead of attempting to compute as fast as possible. For example, in the case of video decompression, there would be no advantage in decompressing faster than the 30 ms display update rate. One architectural approach for maintaining throughput with slower circuitry is to use parallelism through hardware duplication. By using identical units in parallel, the speed requirements on each unit are reduced, allowing for a reduction in voltage.
For examDle. consider a dataDath module W Table 1 . Eneqyper operation at a 1.5 Vsupply in 0.8 pn CMOS technology. running at a Irequency off while-switching an average capacitance C, at a supply voltage of 5 V. Parallelizing by a factor of two will result in two units working at half the clock frequency while maintaining the original throughput. Since the speed requirements on each module have been lowered by a factor of two, the supply voltage can be lowered until the gate delays increase by a factor of two, which corresponds to a voltage of 2.9 V from a simple first order theory [3] . By allowing the circuit to operate at half the speed, while maintaining the same system throughput, the power of the overall system then becomes
which is approximately three times lower than the original implementation with one unit. Likewise, pipelining is a similar technique that can be exploited to recoup speed loss at lowered supply voltages, thus enabling a significant reduction in power. These system design strategies make it possible to trade off silicon area for reduced power consumption, certainly reasonable for portable devices, as improvements in technology place availability of transistors secondary to power dissipation as a concern. In the next section, this approach will be used to reduce the encoding power in a single-chip digital video camera design.
The issue of how far this approach can be taken is interesting to address. First, algorithms must embed sufficient parallelism so that the requirement on throughput can be achieved through parallel processing. Sequential algorithms, such as the Viterbi decoding used in decoding convolutional codes and eliminating intersymbol interference in data communication, will need to be reformulated for a low-power implementation [l] .
One limit is reached when the noise margin of the devices is degraded at extremely low supply voltages, although the slower clock rates may in turn reduce the ground bounce problems. However, before this limit is reached, the overhead in increased parallelism to compensate for the reduced logic speed begins to dominate, increasing the power faster than the voltage reduction decreases it. Through a number of examples, the optimum supply voltage was found to occur in the range between 1 and 1.5 V, resulting in an order of magnitude lower power dissipation than the conventional "lowpower" 3.3 V scaling [4] .
With the understanding of the dependency of energy consumption on the supply voltage, the next step is to develop low-power circuit techniques so that architectural trade-offs can be made based on the power dissipation of different types of circuit operations.
ARCHITECTURAL DESIGN PRINCIPLES
In our low-power cell library design, the static CMOS logic style was chosen for its reliability and relatively good noise immunity. Table 1 summarizes the hSPICE simulations of the energy consumed per operation in our cell library, which gives us a guideline for eliminating some operations in favor of others in designing low-power signal processing architectures. For example, an SRAM write access consumes nine times, an offchip 1 / 0 access ten times, and a multiplication approximately four times more energy than a simple add operation. Given that it is possible to implement an algorithm in many different ways, the motivation is to replace memory accesses with computation, since computation dissipates less energy per operation. This trade-off is crucial to architectural design, as will be evident in the next two sections.
The relatively large energy dissipated by off-chip I/O accesses is the motivation for minimizing external data communication. In our design, we eliminated external frame buffers and placed memory on-chip whenever possible.
Because high-quality video implies high pixel bandwidth, with at least millions of pixel operations per second, external frame buffers lead to huge energy waste and overwhelm all other low-power improvements made. Requiring no off-chip memory support is one of the main factors enabling wireless video systems to deliver high-quality video at a minimal power level.
A SINGLE-CHIP DIGITAL CAMERA
The integration of sensors, signal processing, DRAM, and R F devices on a single die has spawned recent advances in CMOS technology and circuit innovations. A single-chip digital camera would require careful integration of all the devices mentioned above. Given the recent development in CMOS RF transceiver design [6, 71, wireless transmission at a bandwidth in excess of 10 Mb/s will soon become possible using next-generation CMOS technology. To make wireless video a reality, real-time low-power video compression is the key. In this section we will discuss the design of a low-power large-scale parallel MPEG2 encoder architecture to be used in a single-chip digital CMOS video camera. Digital cameras video applications, and wide-area surveillance systems. The system requirements are real-time video compression to reduce data bandwidth and low power consumption for extended battery life. The recent development of CMOS sensors makes CMOS a candidate technology for manufacturing such systems at low cost. Column processor 2 :
Column processor I :
The single-chip digital camera architecture proposed in this article includes a 640 x 480 array of CMOS photo diodes, each pixel having an 8-bit dynamic range per color plane (assuming separate color filters), embedded DRAM for storing four frames of color data, and parallel array processors for video signal processing. The parallel processor architecture is designed to implement highly computationally intensive image and video processing tasks such as color conversion, discrete cosine transform (DCT), and motion estimation for MPEGZ [8] .
The proposed architecture exploits the parallelism inherent in video processing algorithms, the small dynamic range used by existing video compression algorithms, CMOS sensor technology [SI, and embedded DRAM technology to realize a lower-power single-chip solution for low-cost video capturing. The acquired video data is stored directly in the on-chip embedded DRAM which serves as a high-bandwidth video frame buffer. The bandwidth of embedded DRAM can be as high as 8 Gbytesis, making it possible to support parallel video processors. Figure 1 outlines the layout of the CMOS photo sensors, embedded DRAM, and parallel video processors. The CMOS photo sensors reside on the surface of the silicon and consist of photo diodes, pixel analog-to-digital (AD) converters, and AID offset correction circuitry. The embedded DRAM resides either under or beside the photo diodes and provides storage for the current and two past frames of captured images, as well as intermediate variables such as motion vectors (MVs) and multiresolution pixel values. The parallel video processors are located next to the imaging circuitry, and each operates independently on several columns of 480-pixel-wide video data.
DESIGN CONSIDERATIONS
A parallel architecture has the advantage of supporting high computational throughput at low clock rates when executing highly repetitive operations [3] . It is less efficient when operating on more complex algorithms that require access to data outside of the processor domain, defined as the number of columns of pixels each processor serves. The size of the processor domain is therefore an important design parameter, which requires careful examination of the types of video processing algorithms to be used in this application.
The proposed architecture considers three algorithms commonly used in video coding standards: red-green-blue (RGB)-to-yellow-ultraviolet (YUV) conversion, discrete cosine transform (DCT), and motion estimation. RGB-to-YUV conversion is performed on the pixel level and requires no additional information from neighboring pixels.
It is computationally intensive, requiring multiple multiplies and adds per pixel, but can easily be achieved with a parallel architecture. DCT, on the other hand, is performed on a block basis. It operates on a row or column of pixels in each pass and requires bit-reverse or base offset addressing to simplify the instruction set. Implementing DCT with pixel-level processors would be unnecessarily complicated.
Similar to DCT. motion estimation works best with block-level processors. Unlike DCT, in which processing variables are confined within a block, motion estimation requires access to adjacent blocks regardless of the size of the processor domain. The extent of the locality of interprocessor communication depends on the search space. Furthermore, certain motion estimation algorithms do not require any multiplication other than simple shifts. They do, however, require that a mean absolute norm be calculated and accumulated in one clock cycle for optimal performance.
These algorithmic constraints place certain requirements on the design of the parallel processor architecture. On one hand, to reduce power consumption, as many parallel processors as practically feasible should be used to reduce the clock frequency, because a reduced clock frequency implies a lower supply voltage and hence lower energy per operation. On the other hand, if too fine a granularity is used, most of the processor operations will be used for interprocessor communications, wasting energy on data transfers rather than on data computation.
For MPEG-2 encoding, the computational demand required for motion estimation (1.6 BOPS for 30 framesis based on the algorithm proposed by Chalidabhongse and Kuo [1Oj) limits the number of columns in each processor domain to 16, because otherwise the required clock speed for each processor would be too high for a low-power design.
In order to sustain this computational demand, each processor is required to run at a clock frequency equal to or higher than 40 MHz. This clock frequency in turn determines the necessary supply voltage for a given technology. When implemented in a 0.2 p CMOS technology, a 1 V supply voltage should be more than enough to support a 40 MHz operation, Under these conditions, this parallel processor architecture delivers a processing performance of 1.6 BOPS with a power consumption of 40 mW, thus providing a power efficiency of 40 BOPS/W, comparable to a recent design from DEC [11] with properly scaled technology and supply voltages.
In the physical layout, each CMOS photo diode has a dimension of 10 p x 10 p [9] .
With 16 columns of pixels per processor, each processor is limited to a width of 160 pm. This limits the datapath to 36 bits for the arithmetic unit, assuming that the processors are staggered so that certain processing units in the datapath can be made wider.
Technology-wise, although the embedded DRAM can sustain high memory throughput via large data buses (e.g., 64 bits), the access time of the embedded W Table 2 . Cycle countperprocessor per frame.
DRAM is still twice as long as the required cycle time. A direct memory access (DMA) unit is introduced to serve as an interface between the DRAM and the local memory units. In addition, the DMA unit may communicate with adjacent processors to access pixel data outside the processor domain.
PARALLEL ARCHITECTURE
The proposed architecture, as shown in Fig. 2 , is designed to achieve the following three goals simultaneously: realize the imagehide0 processing algorithms, minimize DMA accesses to the pixel DRAM, and maximize computational throughput while keeping power consumption at a minimal level. Minimizing DMA access to the pixel memory is crucial to reduce not only power consumption, but also instruction overhead incurred with access latencies. The proposed processor architecture contains a DMA, a 288-byte block-visible RAM, a 36-byte auxiliary RAM, a 32-word register file, an ALU, an interprocessor communication unit, an external I/O buffer, and a processor control unit. The DMA unit is the primary interface between the parallel processor's local memories (i.e., auxiliary RAM and blockvisible RAM) and the embedded pixel DRAM. It is also the primary mechanism for interprocessor data transfer. The DMA separates the task of pixel memory access from the parallel processor so that DRAM access latencies do not stall program execution. The DMA also supports memory access requests to and from the pixel DRAMS that lie within two processor domains.
The block-visible RAM is used to provide temporary storage for a block of up to 16 x 16 pixels of 9-bit-wide data for motion estimation and 8 x 8 pixels of 18-bit-wide data for IDCT to comply with the IEEE error specifications [8] . Special addressing modes such as bit reversal, base-offset, auto increment, and modulo operations are needed for DCT and motion estimation. Interprocessor communication circuitry is needed to access data between processor domains and to communicate domain-specific information such as MVs and reference blocks during the block search.
The auxiliary memory is a 4 x 8 x 9-bit SRAM used to provide a second pixel buffer for operations that involve two blocks of pixels (i.e., block matching). It provides a second path to the ALU for optimal computational efficiency. The auxiliary memory also serves as a gateway between the processor and the external I/O. Data in this memory can be transferred to the external I/O buffer which communicates with the 1/0 pins.
To complement the 9-bit local SRAM unit, a 32-word 18-bit register file is available. The register file provides a fast, higher-precision, lower-power workable memory space. The register file has two data paths to the ALU, allowing most operations to be confined to the ALU and the register file. It is large enough that it can also store both lookup coefficients (e.g., DCT coefficients) and system variables.
The ALU has limited complexity due to the constraints on area and power. The ALU is implemented with a 36-bit carry select adder, an 8-bit subtractor, a conditional signed negation unit (for calculating absolute values), an 18 x 18-bit I multiplier, a bit manipulation logic unit, a shifter, a T register, and a 36-bit accumulator. Operations involving addition, shifting, and bit manipulations can be executed in one cycle. The calculation of motion vectors involves the 8-bit subtractor, the conditional signed negation unit, and the adder.
These operations are pipelined in two stages so that one subtract-absolute-accumulate (SAA) instruction can be executed every cycle. The T register is used in conjunction with the SAA instruction, primarily for algorithmic power reduction. Finally, a hardware multiplier is implemented to perform the DCT and IDCT efficiently.
PERFORMANCE
Each individual parallel processor consumes less than 1 mW of power at a clock rate of 40 MHz, amounting to approximately 40 mW of total power consumption for the whole processor array. An estimated cycle count per processor per frame needed for each encoding and decoding step is provided in Table 2 . The cycle counts for encoding I, P, and B frames are also listed in the table, which are calculated from an appropriate combination of the cycle counts of individual encoding/decoding steps. The number of cycles necessary to perform IBBPBBPBB MPEG-2 encoding at 30 frameds is estimated to be 35 MOPS for each processor. The utilization of the functional units is approximately 40 percent for the adder, 6 percent for the multiplier, 50 percent for the SAA unit, and 4 percent for DRAM memory accesses. Each processor is estimated to occupy an area approximately 160 pm by 11,000 pm in 0.2 pm CMOS technology.
By employing a parallel architecture, this MPEG2 encoder is capable of delivering a performance of 1.6 BOPS at a 40 MHz clock. As a result, the required supply voltage can be reduced to approximately 1 V, achieving a much higher power efficiency as outlined earlier. If more than 40 processors were used in an attempt to further reduce the power consumption (e.g.. in the extreme case of having one column of pixels per processor domain), the clock rate could be lowered further. However, the MPEG2 encoding procedure dictates that efficient interpixel operations have to be accommodated, which favors a design with multiple columns per processor as a compromise between low supply voltages and complicated interprocessor communications. ing the power dissipated in accessing off-chip memory necessary in the JPEG decoding operation. Within this factor of 100, a factor of 10 can easily be attained by voltage scaling of the power supply, which can be applied to existing designs as outlined earlier. Reduced supply voltage, however, increases circuit delay. To maintain the same throughput or real-time performance, the hardware must be duplicated, increasing the total chip area by at least the same amount. The fact that our video-rate decoder can be implemented with fewer than 700,000 transistors (including 300,000 transistors for on-chip memory) operating at a power-supply voltage less than 1.5 V, and without requiring any off-chip memory support, indicates the efficiency of the compression procedure.
A LOW-POWER VIDEO-RATE 2D SUBBAND DECODER
Subband decomposition divides each video frame into several subbands by passing the frame through a series of 2D low-pass and high-pass filters [17] . Each level of subband decomposition divides the image into four subbands. We can hierarchicallv decomDose the image bv further
A PORTABLE VIDEO-ON-DEMAND SYSTEM
Our second wireless video example is a portable video-ondemand system which not only provides a high compression efficiency, but also embeds a high degree of error tolerance in the compression algorithm to guard against transmission errors often encountered in wireless communication. This system can be used as a portable video-viewing device or in a virtual reality system with a head-mounted display, without any wire connection to either the power supply or data network. To achieve extremely low system power, as compared to the previous example where standard algorithms were used, we had to customize the compression algorithm. The compression algorithm used in this video-on-demand system [12] is based on subband decomposition [13] and pyramid vector quantization (PVQ) [14] , which performs as well as standard motion-JPEG compression in terms of image quality and compression efficiency, while incurring much less hardware complexity and exhibiting a high degree of error resiliency [15] . The implementation of a real-time video decoder based on this compression algorithm will be the focus of this section. while decoding 30 framesis of video,l our subbandiPVQ decoder is more than 100 times more power-efficient, not countsubdividing each subbani. The c6oice of low-grid hgh-pass filter coefficients is determined by their ability to accurately reproduce the original image and cancel frequency aliasing between subbands.
In designing the subband decoder chip, we emphasized a low-power implementation without introducing noticeable degradation in decompressed video quality. Since memory accessing is by far the most power-consuming operation, the main design strategy has been to eliminate memory accesses in favor of on-chip computation. Figure 3 illustrates the overall chip architecture. The linedelay memory stores the horizontal filter outputs and sends them to the vertical filter followed by the scale unit. Lower-level results are temporarily stored in the intermediate-store memory, where they are passed back to the input buffer for reconstruction of the next level. Top-level subband results are stored in the final-result buffer before conversion from the YUV to RGB color space. The RGB results are sent off-chip to a digital-to-analog convertor (DAC) and then to the display.
The subband decoder chip, implemented in 0.8 p CMOS technology, occupies a chip area of 9.5 mm2 x 8.7 mm2 (with an active area of 44 mm2) and contains 415,000 transistors [13] . At a 1 V supply voltage, the chip runs at a maximum frequency of 4 MHz and delivers one output every two clock cycles. To subband decoder chip need only run at 6.4 MHz, while most JPEG decoding chips are required to run at a frequency between 30 and 50 MHz. For higher-resolution images, multiple chips would be cascaded, each operating on a maximum 256-pixel-wide slice, producing a final image without boundary artifacts. Table 3 illustrates the power dissipation of multiple subband decoder chips at the required clock frequency when used for decompressing high-resolution images. The operating voltages are determined by the real-time computation requirements. This additional chip-level parallelism keeps the operating frequency and thus the supply voltage low, resulting in extremely low power dissipation even for high-definition TV (HDTV) applications.
A LOW-POWER VIDEO-RATE PYRAMID VECTOR QUANTIZATION DECODER
Pyramid vector quantization (PVQ) is a VQ technique that groups data into vectors and scales them onto a multidimensional pyramid surface. PVQ is a compression technique that provides both compression efficiency and error resiliency, and is well suited to portable video applications [15] . Our PVQ decoder chip performs decompression by converting PVQ codewords into data values and integrates all functionality on a single die [14] , requiring no external hardware support or memory.
Unlike standard VQ schemes, which require codebook storage, product PVQ relies on intensive arithmetic computation to perform encoding and decoding. The product PVQ encoding process is as follows: a data vector, formed with L values, is encoded by scaling the vector onto an L-dimensional pyramid surface and finding the nearest lattice point on the pyramid (Fig. 4 with L =3) . Both the scaling factor and an index corresponding to that lattice point are transmitted. The decoding process converts the index back into an L-dimensional vector and scales that vector using the scaling factor.
Since the lattice points on the pyramid are regularly spaced and described by recursive equations of combinatorial values, encoding and decoding PVQ indices is performed with arithmetic computation, using shifts, subtracts, lookups, and compares.
The overall block diagram in Fig. 5 shows the general data flow of the chip. The PVQ decoder is divided into four processing blocks. The stream parser parses incoming 16-bit words into PVQ indices and scaling factors using a series of two 32-bit-wide barrel shifters. The index parser decodes each index into four intermediate indices (pattern, shape, sign, and nonzeros) which describe the content of the vector. The vector decoding unit inputs these intermediate indices and generates a data vector by iteratively comparing and subtracting these indices from precomputed combinatorial offsets stored in a ROM. Each of these comparisons increments a counter which sets the decoded vector value. Finally, a 6 x 8-bit pipelined multiplier performs the final scaling of each decoded vector element using the scaling factor sent with each PVQ index. The multiplier comprises a 4-2 adder tree with a carry-propagate adder. FIFO buffers separate the four processing blocks and regulate data flow between them.
The PVQ decoder chip occupies a chip area of 9.7 x 13 mm2 (with an active area of 74.9 "2).
Because the PVQ decoder chip needs to operate at a frequency twice that of the subband decoder chip to meet the video-rate throughput at 
A WIRELESS VIDEO DECODER SYSTEM
The complete video-on-demand system consists of a portable decoder with a color LC display that decodes and displays compressed video sequences, a radio transmitter and receiver, and an encoding base station implemented by a DSP multiprocessor board. The wireless data transmission is provided by three pairs of direct-sequence spread spectrum radio transceivers manufactured by Proxim, delivering a raw data rate at 720 kbls. The decoding chipset on the portable decoder receives compressed video data and decompresses them to RGB color components, which are then converted to analog signals for the color display. The display is a 4-in color thinfilm-transistor active matrix display with a resolution of 160 pixels x 234 lines. Figure 6 shows the block diagram of the portable decoder design. The received bitstream is decoded by the PVQ decoder chip into subband coefficients. The subband decoder chip then filters these coefficients and reconstructs them into image pixels, each represented by three digital RGB color components. The total power consumption of the decoder chipset is less than 8 mW.
CON CL US ION s
This article presents two wireless video systems, a single-chip digital CMOS video camera as a video encoding device and a portable video-on-demand system as a video decoding device. Both systems demand low power consumption and high computation. To achieve these two seemingly conflicting goals, specialized algorithm development and architectural innovations are often necessary. For the purpose of low power consumption, we applied the low-power design strategies outlined herein to both systems, in which silicon area is traded off for lower processing speeds, hence achieving lower energy per operation. For the purpose of high performance, an efficient parallel processor architecture was employed in the digital camera design, while specialized compression algorithms were developed for the portable video-on-demand system.
In designing these two wireless video systems, we adopted a vertical integration approach, optimizing the design process at all levels, from algorithm development to system architecture to circuit layout. We learned that system performance need not be sacrificed for lower power consumption if the design of algorithms and hardware can be considered concurrently. We feel that this hardware-driven architectural design approach is the key to future low-power implementations of highly integrated systems. 
