Abstract-Low power as a de facto is one of the most important criteria for many signal-processing system designs, particularly in multimedia cellular applications and multimedia system on chip design. There have been many approaches to achieve this design goal at many different implementation levels ranging from very-large-scale-integration fabrication technology to system design. In this paper, the multirate low-power design technique will be used along with other methods such as look-ahead, pipelining in designing cost-effective low-power architectures of compressed domain video coding co-processor. Our emphasis is on optimizing power consumption by minimizing computational units along the data path. We demonstrate both low-power and high-speed can be accomplished at algorithm/architecture level. Based on the calculation and simulation results, the design can achieve significant power savings in the range of 60%-80% or speedup factor of two at the needs of users.
I. INTRODUCTION

I
N RECENT years, the need for personal mobile communications-"anytime, anywhere" access to multimedia and communication services-has become increasingly clear. Digital cellular telephony, such as the U.S. third-generation code-division multiple access PCS and the European GSM systems, have seen rapid acceptance and growth in the marketplace. Due to the limited power-supply capability of current battery technology, low-power design to prolong the operating time of those mobile handsets becomes vital to success. On the other hand, as the VLSI fabrication technology advances, it becomes feasible to design the entire multimedia systems on a single chip-system on chip. However, the high power dissipation of the chip calls for extra cooling devices and expensive packages to dissipate the generated heat. It increases both the weight and cost of those systems thus the need for low-power design becomes essential. However, the development of low-power multimedia systems is still in its infancy. The low-power video coding systems achieved at device/process level such as low-power video coder design has been reported in [1] , which uses 0.5 m VLSI fabrication technology. As 0.25 m and 0.18 m CMOS technologies become mature, people tend to consider the low-power design toward that direction. Nevertheless, the cost of device/process-level approach is the most expensive among all other low-power techniques because it requires the investment of new semiconductor equipment and technology, which is beyond the budget for most small or start-up companies. Furthermore, it takes time for the fabrication technology to be mature enough for mass production and for the computer-aided design (CAD) tools to handle those "deep submicron" effects. Other than device/process-level approach, recently wide techniques are used to achieve low-power, cost effective architectures for video coding system [2] - [4] . Those designs are achieved under the current technology without investing and waiting for the new expensive devices, advanced VLSI fabrication technology and CAD tools.
In this paper, we design the low-power video coding co-processor at the algorithm/architecture level, which provides the most leveraged way to achieve low-power consumption when both effectiveness and cost are taken into consideration [5] . Basically, the algorithm/architecture low-power design is achieved by reformulating the algorithms and mapping them to efficient low-power VLSI architectures to compensate for the speed loss caused by lowered supply voltage. We emphasize on optimizing the power consumption of the video coding co-processor design by minimizing computational units along the data path. Let us explain our idea in more detail. The conventional hybrid motion-compensated DCT video coding structure adopted by the standards is not optimized in terms of hardware complexity because both the motion estimation and DCT/IDCT units, which consume 80% of the design [2] , [6] , [7] , cannot be combined together into one unit. Thus, the following question can logically be posed: "Can we estimate motions also in the compressed domain so that we can optimize the power consumption by reducing the computational units?" In the category of compressed-domain motion estimation, three-dimensional fast Fourier transform (3D-FFT) has been successfully used to estimate motion in several consecutive frames [8] , [9] . But this approach is not compatible with the standards because it requires processing of several frames rather than two. Moreover, the FFT operating on complex numbers is not used in any videocoding standards and the global routing structure is undesirable in VLSI design. Fortunately, the standard complied solutions, the fully DCT-based motion compensated video coding algorithms, have been provided in [10] , [11] . As we all know, the phase of Fourier transform of the shifted signal encapsulates the information about the shift. Based on the same argument, the authors discover that the motion information of P or B frame is actually embedded in its DCT coefficients. In other words, the motion can be extracted based on the DCT coefficients of the block in current frame (P/B) and its corresponding one in previous frame (I/P). The overall system architecture is shown in Fig. 1 . The main advantages to adopt such an approach are listed as follows.
• From the implementation viewpoint: We can save silicon area significantly by naturally accommodating both DCT and motion estimation processors into one processing unit (based on the VLSI implementation results, the chip size of our combined design (DCT half-pel motion estimation) under normal operating condition is smaller than or about the same as those block-matching designs alone without DCT/IDCT unit [12] ). This nice property is very useful for our low-power design at algorithm/architecture level.
• From the system delay viewpoint: The DCT can be moved out of the feedback loop and thus the operating speed of DCT can be reduced to the data rate of the incoming video stream. Moreover, IDCT is now removed from the feedback loop thus there are only quantizers and compressed domain motion estimator in the loop. This not only reduces the complexity of the coder but also reduces the system delay without any tradeoff of performance.
• From the algorithm viewpoint: It reduces overall complexity significantly compared to the hybrid motion-compensated DCT video coding schemes in the standards because the overall complexity is now dominated by DCT computation instead of block matching. Due to its DCT-based nature of the algorithm, the fully CORDIC-based (COordinate Rotation Digital Compute [13] ) architectures, under normal operating condition, and its corresponding signal chip VLSI implementation have been proposed in [12] , [14] , [15] .
In this paper, we extend the video coding architectures in [14] , [15] for low-lower applications. All advantages mentioned in the CORDIC-based design, i.e., high throughput, numerical stability, multiplier-free, modular and solely local connected properties are also inherited in our low-power design. Based on the calculation and simulation results, the proposed design can be readily applied to high-speed video communication with the speedup factor of two under normal supply voltage i.e., 5 V. Or, the same design can operate at two-time slower operating frequency under lowered supply voltage (3.08 V) while retaining the original data throughput rate. It enables us to achieve significant power saving in the range of 60%-80% without sacrificing system performance (refer to the detailed discussion later). Therefore, our low-power design can smartly conquer both low-power and high-speed requirements, which are often considered to be the problems of opposite natures, at the needs of users.
The multirate low-power design technique [16] , [17] will be used along with other low-power design methods such as look-ahead, pipelining in our design to achieve low-power/high-speed performance. In what follows, we explain the detailed design of our compressed domain low-power video coding co-processor. Then we present the simulation results in Section III to demonstrate the performance of our design. Finally the paper is concluded in Section IV.
II. LOW-POWER/HIGH-SPEED ARCHITECTURES
As we have pointed out earlier, the block matching approach estimates the motion by the best matching while the compressed domain approach estimates the motion by comparing the energy, in terms of the DCT coefficients, of the shifted images. Although it is not as intuitive as those block matching methods, it is helpful in understanding this scheme by considering pseudo-phase in compressed domain design analogous to phase in Fourier transform. In other words, the compressed domain approach is based on the principle that a relative shift in the spatial domain results in a linear phase shift in the Fourier domain. The proposed low-power design to realize such a compressed domain scheme has fully pipelined parallel architecture, as shown in Fig. 2 . It consists of four major processing stages. Here we are only considering the combined design of DCT and motion estimation units, which serves as the computing engine or co-processor of the whole video coding system. The motion can be estimated by taking the current and its reference blocks, and with the size of , as inputs. If these two blocks differ by a translational displacement, then the displacement can be found by locating the peak of the inverse two-dimensional DCT (2D DCT) transform of the normalized pseudo-phase function of these two blocks as follows:
normalized pseudo phases where pseudo phases (1) where pseudo phases is the function of and and are type-II and type-I DCT coefficients, respectively [18] . (In video coding standards, the type-II DCT has been used). Notice that the motion vectors are limited to the block size. If the motion vectors go beyond the block boundary, the motion vector of will be used, instead.
The detailed floor-plane of our low-power/high-speed video co-processor design is shown in Fig. 3 . It is shown that doubling the accuracy of motion compensation from integer-pel to half-pel can reduce the bit-rate by up to 0.5 bits/sample [19] , [20] . Therefore, a two-stage look-ahead half-pel motion estimator is included in our low-power design, as shown in Fig. 3 . In other words, our co-processor can estimate motion at either int-pel or half-pel resolution based on the needs of users. Next, we will focus on the design of each building blocks of the compressed domain video coding co-processor.
A. Two-Stage Look-Ahead Type-II DCT/IDCT Coder
Unlike the conventional DCT coder design using matrix factorization, we adopt the time-recursive DCT [21] , [22] which is able to simultaneously generate type-II DCT and discrete sine transform (DST) coefficientsand needed to compute the pseudo-phase function. Furthermore, in real-time video signal processing, the data arrive serially. The traditional transformation algorithms [23] buffer the incoming data and then perform the transformation with the complexity of while the time-recursive approach merges the buffering and transform operations into a single unit of total lower hardware complexity (Here is the block size). Most importantly, due to the inherent time-recursive characteristic, we can use look-ahead method to reduce the power consumption. In principle, the speed-up provided by look-ahead compensates the speed loss caused by reduced supply voltage at the cost of increasing hardware complexity.
1) Two-Stage Look-Ahead DCT:
The type-II one-dimensional DCT/DST (1D-DXT-II) of a sequential input data starting from and ending with is defined in [21] Both (2) and (3) can be combined into the following equation: (4) where is related to and is related to by
Based on the above derivations, we can combine those equations together and get (6) The following illustrates how this dually generated DCT and DST lattice structure works to obtain the DCT and DST with a series of input data for a specific . The initial values of the transformed signals and are set to zero so are the initial values in the shift register in the front of the lattice module, as shown in Fig. 4 . In the high-speed image system such as HDTV, digitized images are available in a sequential or stream fashion. In the conventional approaches, the serial data is buffered and then transformed. Waiting for data to become ready will cause additional delay, which is not desirable for real-time service. In our time-recursive design, those input sequence shifts sequentially into the shift register. Then the output signals and , are updated recursively according to (6) . The multiplications in the plane rotation in (6) are replaced by three CORDIC processors. After the input datum shifts into the shift register, the DCT and DST coefficients are dually obtained at the output for this index . To improve the throughput and reduce the latency, a parallel lattice array consists of such lattice modules can be used for parallel computations.
2) Two-Stage Look-Ahead Inverse DCT: The type-II inverse IDCT is defined as
The two-stage look-ahead time-recursive updating of IDCT coefficients is given by (7) We can rewrite (7) as follows: (8) where and . The reason to introduce the auxiliary variable , which is defined as is to keep the lattice structure for numerical stability and multiplier-free architecture. And, is related to and is related to by
Based on the above derivations, we can combine those in (9) together and get (10) As a result, we can substitute and in (8) and get (11) Notice that is just an auxiliary variable to keep the lattice structure. The real variable which we are interested in is , which is defined as (12) By following the similar procedure as above, we can relate to as
Or (13) where stands for don't care. Both two-stage look-ahead DCT computation in (6) and its inverse counterpart, IDCT computation in (11) and (13), undergo the similar computing procedure except for minor differences in the input data and rotation angles. In order to save chip area, we can interleave them into a unified structure which contains three CORDIC's, as shown in Fig. 4 .
Clearly, the look-ahead system can be clocked at two-time faster rate than the original system for high-speed application. Or, by reducing the supply voltage from to , we increase the propagation delay of look-ahead system until it equals to that of the original system. The propagation delay at supply voltage of is given by (14) where is the capacitance along the critical path, is the device threshold voltage, and is a constant which depends on the process parameters. For -stage look-ahead system, the propagation delay is (15) By equating in (15) to in (14), we get the following equation: (16) 
Substituting
V and all V, we find that for a two-stage look-ahead system (i.e., ) a supply voltage of V is necessary for the two propagation delays to equal each other. In other words, we achieve low-power design while still keep the same system throughput.
The dynamic power consumption of a CMOS circuit is given by (17) where is the average fraction of the total node capacitance being switched (also referred to as the activity factor), is the total switching capacitance, is the supply voltage and is the clock frequency. By employing (17), we get the ratio of the power consumption of two-stage look-ahead design, , to the power of original design, , as V V where is the original operating frequency, and represent the total switching capacitances of look-ahead and its original implementation. Provided that the capacitances due to CORDIC's are dominant in the circuit and are roughly proportional to the number of CORDIC's, we get because the low-power design requires three CORDICs while the original design needs two CORDIC's. Overall the look-ahead design results in 72% power saving without sacrificing the system throughput at the expense of 50% hardware overhead. In essence, we trade silicon area for low-power consumption.
Based on the same approach as two-stage look-ahead design, we can extend to four-stage look-ahead design and beyond. Let us look at four-stage look-ahead design. By substituting V, V and into (16), we get V. The ratio of the power consumption of four-stage look-ahead design, , to is
Our studies have revealed that the power saving is generally increased by employing more look-ahead stages . However, beyond a certain "critical point"
, the percentage of further power saving is small while hardware cost is increasing drastically. The results are listed and plotted in Table I and Fig. 5 , respectively. To illustrate the concept of our low-power design, we choose two-stage look-ahead in this paper for our low-power DCT coder design.
Because 2D-DCT can be decomposed into two-stage pipelined 1-D computation, we therefore adopt the same approach as in [14] to extend our low-power DCT design to 2-D design. As a result, it is able to output four type-II DCT coefficients and such as for (18) simultaneously, as shown in Fig. 6 .
B. Pipelining Design for DCT Coefficients Conversion
Pipelining is the most commonly used technique to achieve high-speed. The main idea behind is to insert flip-flop between consecutive pipeline stages so that the delay through the critical path can be shortened by a factor of . As a result, the speed of the system is times faster than that of the original system at the penalty of increasing system latency. On the other hand, the pipelining can be used to compensate for the delay incurred in the low-power design when supply voltage drops.
In order to calculate the pseudo phases, the type-I DCT coefficients of previous block, and such as for (19) are needed. However, it is undesirable to compute those type-I coefficients separately from type-II coefficients otherwise it will increases overall hardware complexity significantly. As a matter of fact, this problem can be circumvented because those type-I DCT coefficients can actually be obtained by the plane rotation of its counterpart type-II DCT coefficients, and , which are stored in the array registers as shown in Fig. 3 . Those type-I and type-II DCT coefficients are related as follows:
(20) where . This DCT coefficient conversion can be realized by two-stage orthogonal plan rotations as shown in Fig. 7 . By inserting flip-flops across the feed-forward cut-set, we can achieve high-speed design. Now the pipelining design can run two-time faster than the original design because the critical path has been halved. Or, we can reduce the power supply voltage from 5 V to 3.08 V based on the same argument as in (16) while still maintain the original system throughput. The ratio of the power consumption of pipelining design, , to the power of original design, , is given by V V which leads to 81% power saving at the cost of increased system latency. Here the operating frequency is the same as that in DCT coder because our low-power design is a synchronous design.
C. Multirate Design for Pseudo-Phase Computation
Other than pipelining, parallel data processing is another frequently used technique to achieve high-speed design. In principle, the desired functions are decomposed into independent and parallel small tasks. Then the small tasks are executed concurrently and individual results are combined together. The well-known "divide-and-conquer" strategy is one of these kinds of parallel processing. The goal of parallel processing is to utilize each processing element (PE) fully to achieve maximum data throughput rate. Therefore, this feature is very suitable for high-speed data processing and its modular design is very desirable for VLSI implementation. The multirate approach used in this paper is belongs to this category.
Traditionally, multirate technique is widely used in subband coding-based compression of audio/video signals and in trans-multiplexers that convert between time and frequency division multiplexing [16] . Our interest, on the other hand, is to apply this technique to compensate the speed loss due to lowered supply voltage or to simply speed-up the design under normal condition. The pseudo-phase functions and of integer-pel displacements can be obtained by solving the following system equation [10] for (21) where is the integer-pel displacement and is the pseudo-phase vector. For illustration purpose, the pseudo-phase computation module in the original design is shown in Fig. 8(a) (please refer to [14] for the detail design). Here the processing rate of the operator has to be as fast as the input data rate.
By employing multirate low-power design, the pseudophases are computed from the reformulated circuit using the decimated sequences , as shown in Fig. 8(b) . The additional concurrency is obtained by dividing the data stream into odd and even sequences. Both pseudo-phase computation modules process such sequences concurrently and individual outputs are then combined together. Now the multirate design operates at two different rates. Because the operating frequency of pseudo-phase computation is reduced to half of the input data rate while the overall throughput rate is still remained the same, the speed penalty therefore is compensated at the Fig. 10 . Programmable module for low-power half-pel motion estimator. Here the "Interface connection" is set based on Table II. architectural level. With the similar argument stated previously in Section II-A2, we can keep the overall throughput rate while reduce the power supply voltage from 5 V to 3.08 V. The multirate design needs 20 CORDIC's, which is twice the number of CORDIC's in original design plus additional down/up-sampling devices. Provided that the capacitances due to CORDIC's are dominant in the circuit, the ratio of the power consumption of multirate design, , to the power of original design, , can be obtained as V V
Overall, we can achieve the power saving of 62% or the speed-up factor of two at the cost of doubled hardware complexity.
D. Pipelining Design for Peak-Search
As we have pointed out in the introduction, the translational displacement can be found by locating the peak of the inverse 2D-DCT transform of the normalized pseudo-phase function. Unlike full search block matching, this peak-search is a quite straightforward process because we only need to locate the maximum values of the 2-D matrices. The 2-D search can be simply (41) decomposed to row-then-column 1-D search. The decomposition search looks for the peak value of each row, followed by a column search of the previous results. If we fail to locate the peak e.g. the motion vector goes beyond the block boundaries, the motion vector of will be used, instead. Since it is fully pipelined, we can insert flip-flops after the peak search of each row. Therefore, we cut the critical path by half and get the low-power design. Based on the same argument in Section II-B, it achieves 81% power saving under 3.08 V supply-voltage. 
E. Two-Stage Look-Ahead Half-Pel Motion Estimator
To obtain motion at half-pel accuracy, we first compute the integer-pel motion vectors then use "two-stage look-ahead half-pel motion estimator" in Fig. 3 to compute the half-pel motion vectors. With such an approach, we can avoid conventional interpolation procedure to determine the half-pel motion vectors by only considering the nine positions and surrounding integer-pel motion vectors [11] . As a result, it decreases the overall complexity and avoids undesirable data flow. In other words, the peak positions among and
indicates the half-pel motion as illustrated at the upper right corner of Fig. 9 . Next we will explain how to adopt look-ahead approach mentioned previously to achieve the low-power half-pel motion estimator architecture. By taking a close look at (22) and (23), we observe that both and computations are similar. Here we use computation as an example, the same approach can be applied to . In order to figure out , we can decompose its 2-D computation into two-stage hierarchic 1-D calculations as illustrated in Fig. 9 . As a matter of fact, those computations encircled by dot-boxes, such as (24) in the middle level of Fig. 9 , are similar except the phase differences such as and . Therefore, those computations can actually be integrated and realized by a programmable structure, as shown in Fig. 10 . 
Here, is one of the integer-pel motion vector and indicates different phases. The time index in and denotes the transform starting from which are pseudo phase functions derived in Section II-C. In addition, an auxiliary variable (26) is introduced to maintain the lattice structure similar to that of DCT computation in Section II-A2. To achieve low-power design, we need to find out two-stage look-ahead coefficients and in terms of and . Furthermore, can be obtained indirectly through . Next, we will derive those relationships.
For phase or
Both (27) and (28) can be combined into the following equation:
can be rewritten as (30) can be rewritten as
By combining (30) and (31), we can express in terms of as
We can also relate to as 
The auxiliary variable is related to by
The corresponding parameters and in (40) and (41) depending on the different phases are listed in Table II . The unified programmable module requires three CORDIC's, as shown in Fig. 10 .
Based on the previous assumption, we get the ratio of power consumption of look-ahead design, , to the power of original design, , as follows:
where . Therefore, we can achieve 72% power saving at the expense of 50% hardware overhead.
such programmable modules can be used for parallel computing of for different channels , as shown in Fig. 11 . The peak position among those values indicates half-pel motion vector. Overall the two-stage look-ahead half-pel motion estimator needs a total of CORDIC's and adders, as listed in Table IV .
III. SIMULATION RESULTS AND HARDWARE COST
We implement the low-power architectures for video coding co-processor using both C and Verilog. Simulations are made to verify the behavior of our design by taking "Miss America" and "Flower Garden" etc. as the test sequences. The original frames and reconstructed frames using our proposed low-power design are shown in Fig. 12(a) and (b) and Fig. 13(a) and (b) , respectively. The simulation results demonstrate that our low- power design can achieve comparable video quality as the original ones. We also compare the speed of normal and our lowpower/high-speed design of individual module in Fig. 3 . Here the simulation is performed at the gate level with 0.8 m CMOS technology. The speed-up factors are listed in Table III . Based on the simulation results, we observe that our design can operate at about two-time faster clock rate than the original design, which is corresponding to our discussion/derivations in Section II.
To process the video sequence, each frame is divided into nonoverlapped block which contains pixels as input to our low-power/high-speed design. The hardware cost and throughput of each building block in Fig. 3 to process those blocks are summarized in Table IV. Our design is flexible and  scalable because it requires CORDIC processors, adders to get motion vectors at integer-pel accuracy and additional CORDIC's, adders for those at half-pel accuracy. The number of hardware components needed in low-power design compared to that in original design [14] , [15] is also plotted in Fig. 14. A common question is: "How good is our design compared to those traditional block matching motion estimation methods?" Without low-power design, we have implemented our combined motion estimation and DCT units into a single chip [12] . The video coding system works at 20 MHz clock rate under 5 V supply voltage with 0.8 m CMOS technology. By comparing to the traditional (full search or hierarchical search) motion estimation approaches [7] , [24] - [30] , our design are smaller than or about the same as those block-matching designs with respect to both power and area. It is important to note that our chip naturally accommodates both DCT and motion estimation units while the others may require multiple chips. By applying the Powermill developed by Synopsys, we compare the power consumption of different modules in our designs under both normal and low-power operating conditions as listed in Table V . Our low-power design operates at 10 MHz at 3.3 V supply voltage. It achieves the same throughput rate 157.82 Mbps as the normal design with the same 0.8 m CMOS technology. From the simulation results, we observe the following: Compared to the normal design, our low-power design achieves the power saving of 64.51% with the power consumption of 52.44 mw. It corresponds to our previous calculations that our low-power design saves the power at the range of 60%-80% at the cost of much less than doubled hardware complexity, as shown in Fig. 14 . As a result, low-power design at the algorithm/architecture level is the most leveraged way to achieve low-power consumption when both effectiveness and cost are taken into consideration.
IV. CONCLUSION
Anticipating the future trend of running multimedia applications on the portable personal devices, we propose the cost-effective low-power/high-speed co-processor architectures for video coding systems. Unlike the low-power video codec design using the costly advanced deep submicron fabrication technology, our low-power design is achieved at the algorithmic/architectural levels. Our emphasis is on optimizing power consumption by minimizing computational units along the data path. Compared with other approaches, our algorithmic/architectural low-power approach is one of the most economical ways to save power. Techniques such as look-ahead, multirate, pipelining have been used in our design. Based on the calculation and simulation results, our power saving is in the range of 60%-80% or speed-up factor of two at the needs of users. 
