Abstract-In most low-power VLSI designs, the supply voltage is usually reduced to lower the total power consumption. However, the device speed will be degraded as the supply voltage goes down. In this paper, we propose new algorithmic-level techniques to compensate the increased delays based on the multirate approach. We apply the technique of polyphase decomposition to design low-power transform coding architectures, in which the transform coefficients are computed through decimated lowspeed input sequences. Since the operating frequency is M-times slower than the original design while the system throughput rate is still maintained, the speed penalty can be compensated at the architectural level. We start with the design of lowpower multirate discrete cosine transform (DCT)/inverse discrete cosine transform (IDCT) VLSI architectures. Then the multirate low-power design is extended to the modulated lapped transform (MLT), extended lapped transform (ELT), and a unified low-power transform coding architecture. Finally, we perform finite-precision analysis for the multirate DCT architectures. The analytical results can help us to choose the optimal wordlength for each DCT channel under required signal-to-noise ratio (SNR) constraint, which can further reduce the power consumption at the circuit level. The proposed multirate architectures can also be applied to very high-speed block discrete transforms in which only low-speed operators are required.
I. INTRODUCTION

D
UE to the limited power-supply capability of current battery technology, the power constraint becomes an important consideration in the design of personal communications services (PCS) devices. It has been shown that a reduction of the supply voltage is the leveraged way to lower the power consumption. However, a speed penalty is suffered for the devices (operators) as the supply voltage goes down. In [1] , the techniques of "parallel processing" and "pipelining" were suggested to compensate the speed penalty, in which a simple comparator circuit was used to demonstrate how parallel independent processing of data can achieve good compensation at the architectural level. In most digital signal processing (DSP) applications, however, it is almost impossi- ble to directly decompose the problems into independent and parallel tasks as in the comparator case. The properties of the DSP algorithms should be fully exploited in order to develop efficient compensation techniques to compensate the loss of speed performance under low-voltage operations. The main issue is to reformulate the DSP algorithms so that the desired outputs can be obtained at low-power consumption without hindering the system performance such as the data throughput rate. We call such an approach the algorithm-based low-power design.
In this paper, we purpose a compensation technique-the multirate approach-to the design of low-power transform coding architectures at the algorithmic/architectural level. To motivate the idea, let us consider the discrete cosine transform (DCT) architecture in Fig. 1 . For most of the existing serialinput-parallel-output (SIPO) DCT algorithms and architectures [2] , [3] , the processing rate must be as fast as the input data rate [ Fig. 1(a) ]. In our low-power design, the DCT is computed from the reformulated circuit using the decimated sequences [ Fig. 1(b) ]. Since the operating speed of the processing elements is reduced to half of the original data rate while the data throughput rate is still maintained, the speed penalty is compensated at the architectural level. As to the power consumption, using the CMOS power dissipation model [1] , we can predict that the overall power consumption of the multirate design can be reduced to about one-third of the original system. Therefore, the downsampling scheme provides a direct and efficient way for the low-power design at the algorithmic/architectural level.
Two different approaches to derive the multirate low-power transform coding architectures are presented in this paper. One is the polyphase decomposition approach. By applying the technique of polyphase decomposition [4] to the infinite extent impulse response (IIR) transfer functions of discrete orthogonal transforms [3] , we can reformulate the transfer functions so as to perform those transforms using -times slower (where is the decimation factor) decimated input sequences. The speed penalty caused by the low-voltage operation can be compensated at the architectural level at the expense of linear complexity increase. The other is based on the logarithmic decomposition approach. We show a scheme to perform the polyphase decomposition in a cascade way so that only overhead is required to compensate the speed penalty. We illustrate both low-power design approaches by using the DCT as examples. Later, the designs are extended to the modulated lapped transform (MLT) and extended lapped transform (ELT) [5] , [6] . Then, based on the derivations of the MLT and ELT, we propose a unified low-power transform coding architecture. It can perform most of the existing discrete orthogonal transforms by simply reprogramming the computational modules.
At last, we examine the finite-wordlength effect of the proposed low-power DCT architectures. In general, shorter wordlength results in fewer switching events, lower capacitance, and shorter average routing length in the VLSI design. To achieve low power consumption at the circuit level, we need to choose minimum wordlength without degrading the signal-to-noise ratio (SNR) requirement. By applying our fixed-point analytical results, we can assign the optimal wordlength for each DCT channel under the SNR constraint. Moreover, the analyses show that the multirate designs have better numerical properties under fixed-point arithmetic.
The organization of this paper is as follows. In Section II, the derivation of the low-power DCT/IDCT algorithms and architectures are described. In Section III, we derive the multirate MLT and ELT algorithms and architectures. Then, a unified low-power IIR structure for most discrete orthogonal transforms is described. The fixed-point analysis is presented in Section IV followed by the conclusions.
II. MULTIRATE LOW-POWER DCT/IDCT ARCHITECTURES
The DCT of a series of input data starting from and ending at is defined as
where and , , are the scaling factors. By considering the transform operator as a linear shift invariant (LSI) system that maps the serial input data into the DCT coefficients, we can derive a second-order IIR transfer function from (1) as [3] ( 2) where ; and denote thetransforms of and , respectively. For block processing, the in (2) can be eliminated. The corresponding IIR DCT structure is shown in Fig. 2 . It works in a SIPO way, and the resulting parallel architecture is regular, modular, and fully pipelined. Also, the SIPO feature avoids the input buffers as well as the index mapping operations that are required in most parallel-input-parallel-output (PIPO) DCT architectures [7] , [8] . One disadvantage of the IIR structure in Fig. 2 is that the updating of the DCT coefficients must be as fast as the input data rate. Hence, it suffers from the speed penalty as the supply voltage of the computational module goes down for low power consumption. We shall reformulate the transfer function using the multirate approach, so that speed degradation can be compensated at the architectural level.
A. Low-Power Design of the IIR DCT
Splitting the input data sequence into the even sequence, , , and the odd sequence, , , (1) becomes
Taking the -transform on both sides of (3) and rearranging, we have (4) where and are the -transforms of and , respectively. The parallel architecture to realize (4) is depicted in Fig. 3 .
To achieve downsampling by the factor of four, we can split the input data sequence into four decimated sequences for . Following the derivations in (3) and (4), we can write as (5) where is the -transform of The corresponding multirate architecture is shown in Fig. 4 .
From Figs. 3 and 4, we can see that basically the multirate DCT architectures retain all advantages of the original IIR structure in [3] such as modularity, regularity, and local interconnections. This makes the proposed architectures very suitable for VLSI implementations. It is also interesting to see that the speed-compensation capability of our architectures is achieved at the expense of "locally" increased hardware complexity and routing paths. This feature of local interconnection and local hardware overhead is especially preferable in VLSI design when the transformation size is large (e.g., the MPEG audio codec in which a 32-point DCT/IDCT is used [9] ).
B. Low-Power Design of the IIR IDCT
The IIR transfer function for the IDCT is given by [3] (6) where
. As with the derivations of the low-power IIR DCT, the multirate transfer function for the IDCT with can be derived as The corresponding low-power IIR IDCT structure is illustrated in Fig. 5 . Similarly, the multirate transfer function for can be derived as
C. Polyphase Decomposition Approach
In the preceding discussions, we derived the multirate DCT/IDCT by rearranging the -transforms of the decimated sequences. Here, we will show a systematic way to derive the results by applying the polyphase decomposition [4] to the original IIR transfer function.
Substitute the identity that (9) into the IIR DCT transfer function in (2). After rearranging, under block processing can be written as (10) where and , and its corresponding polyphase implementation is shown in Fig. 6(a) . Then we can apply the noble identities [4] to distribute the downsampling operation toward the left and obtain Fig. 6(b) , which leads to the multirate DCT architecture in Fig. 3 . Similarly, can be achieved by performing another polyphase decomposition (11) with and , in (10) . After algebraic simplifications, we can obtain (5), and its corresponding implementation allows us to perform the DCT at four times slower clock rate. To derive the multirate transfer function with an arbitrary , we can repeatedly apply the polyphase decomposition to the IIR transfer function until the resulting transfer function is fully expanded with all exponents being multiples of .
D. Logarithmic Decomposition Approach
In Section II-C, we have shown that the substitution of (11) into (10) leads to the multirate DCT architecture in Fig. 4 . However, this multirate design requires hardware overhead to lower the input clock rate directly by four, which may not be acceptable when is large and the chip area is limited. In order to save the hardware complexity, we rewrite (10) together with (11) in a cascade form; i.e., (12) Fig. 7(a) shows the polyphase implementation of (12) . The corresponding cascade multirate DCT architecture is depicted in Fig. 7(b) . There are two major blocks. One operates at half sample rate and the other at one-fourth sample rate. Since the denominator of the transfer function follows a special format, we can repeatedly perform the polyphase decomposition on the denominator and retain the same cascade form. We then have (13) for . The resulting architecture decimates the operating frequency on a stage-by-stage base: in each stage, the operating frequency is reduced by half. After reaching the th stage, the clock rate becomes times slower than the original data rate. The results can be extended to the IDCT as well as other low-power transformation designs to be discussed in next section.
E. Power Estimation and Complexity Comparison
The power dissipation in a well-designed digital CMOS circuit can be modeled as [10] (14) 
where is the decimation factor and is the threshold voltage of the device.
Assume that V, V in the original system. From (15) , it can be shown that can be as low as 3.1 V for the case of . For the 16-point DCT under normal operation, it requires 30 multipliers and 32 adders. For the low-power 16-point DCT with , 45 multipliers and 49 adders are required. Provided that the capacitance due to the multipliers is dominant in the circuit and is roughly proportional to the number of multipliers, the power consumption of the low-power DCT can be estimated as
where denotes the power consumption of the original system. Similarly, for the case of , the total power of the 16-point multirate DCT can estimated as V V
Table I summarizes the hardware cost for the proposed DCT/IDCT architectures based on the polyphase decomposition approach. As we can see, we can achieve low power consumption at the expense of linear complexity overhead. In the logarithmic low-power design, the feature of multiple operating frequencies in the logarithmic low-power architecture allows us to use different supply voltages according to the slowest allowable operating speed (the so-called voltage scaling approach). As a result, the power consumption of the 16-point low-power DCT architecture in Fig. 7(b) can be estimated as
where is the total multipliers required in the normal DCT in Fig. 2; and are the number of multipliers in the stage and stage, respectively. From (18), we can see that the overall power consumption of the logarithmic low-power design is in between and of the full multirate DCT systems discussed in Section II-A.
As to the complexity, it can be shown that we need a total of multipliers to realize the multirate transfer function in (13) . The comparison of the logarithmic lowpower architecture with other approaches is listed in Table II . Although the total power saving of the logarithmic structure is less than that of the full multirate structure given the same , the hardware overhead is preferable for low-power consumption without trading too much chip area.
F. Comparisons of Architectures
We use the DCT as an example to compare the proposed multirate SIPO architectures with well-known SIPO and PIPO architectures [3] , [8] . A comparison regarding their inherent properties is listed in Table III . The advantages of the SIPO approach over the PIPO approach in their VLSI implementation, such as local communication and linear hardware complexity increase, have been discussed thoroughly in [2] and [3] . Nevertheless, when the speed compensation capability is of concern, the PIPO approach is also a good choice since the PIPO processing with block size is equivalent to decimating the input data by a factor of . However, this advantage is obtained at the expense of "globally" increased hardware and routing paths. Besides, the block size is usually restricted to be power of two due to the "divide-and-conquer" nature of those PIPO fast algorithms. From Table III , we can see that our multirate SIPO approach is a good compromise between the other two approaches. Basically, the multirate approach inherits all the advantages of the existing SIPO approach; meanwhile, it can compensate the speed penalty at the expense of "locally" increased hardware and routing, which is not the case in the PIPO approach.
III. UNIFIED LOW-POWER TRANSFORM CODING ARCHITECTURE
In this section, we extend the multirate design to the MLT and ELT which belong to the family of lapped orthogonal transforms (LOT) [5] , [6] , [12] . They can help to diminish the blocking effect encountered in low bit-rate block transforms. Then, we derive a unified transform coding architecture that is capable of performing most of the discrete orthogonal transforms based on the same VLSI computational modules.
A. The IIR MLT Structure
The MLT operating on segments of data of length , , is defined as [5] : (19) for , where if is even, and if is odd. After some (20) where (21) (22) with block size , and
The IIR transfer functions for (21) and (22) can be computed as
The corresponding IIR module for the dual generation of and is depicted in Fig. 8(a) . This IIR module can be used as a basic building block to implement MLT according to (20) . Fig. 8(b) illustrates the overall time-recursive MLT architecture for the case of . It consists of two parts: one is the IIR module array which computes and at different index in parallel. The other is the programmable interconnection network which selects and combines the outputs of the IIR array to generate the MLT coefficients based on (20) .
B. Low-Power Design of the MLT
As with the low-power DCT, we can compute the multirate IIR transfer functions of and as (26) and (27) The parallel architecture to realize (26) and (27) is shown in Fig. 9 . It consists of two MLT modules in Fig. 8(a) . Through such manipulation, only decimated sequences are processed inside the module. Hence, the MLT module can operate at half of the original clock rate by doubling the hardware complexity. The comparison of hardware cost is shown in Table I . Suppose that denotes the power consumption of the MLT module in Fig. 8(a) . From the CMOS power model, it can be shown that the power consumption for the low-power MLT modules is reduced to and for the case of and , respectively.
C. The ELT and Unified IIR Transform Coding Design
The ELT with basis length equal to for data segment , is defined as [14] (28) for . As with the treatment of the MLT, we can rewrite (28) as Define the relationship in (20) and (29) as the combination functions. After comparing (20)- (23) with (29)- (32), we see that the MLT and ELT have identical mathematical structures except for the definitions of parameters and the combination functions. Hence, they can share the same VLSI architectures that are depicted in Figs. 8 and 9 . We only need to change the multiplier coefficients [use (32)] and interconnection network [use (29)] to perform the ELT.
The aforementioned design concept can be generalized to perform most of existing discrete sinusoidal transforms. For example, in (21) is equivalent to the DCT by setting
As a result, the multirate MLT module in Figs. 8 and 9 can be used to compute the DCT. The other example is the discrete Fourier transform (DFT) with real-valued inputs. With the following parameter setting:
(34) (21) and (22) become (35) (36) which are the real part and the imaginary part of the DFT, respectively. The setting of parameters as well as the corresponding combination functions for other orthogonal transforms is summarized in Table IV . The programmable feature of the unified transform coding architecture can be incorporated into the design of a high-performance reconfigurable DSP computing engine for multimedia applications [15] . 
IV. FINITE-PRECISION ANALYSIS
In this section, we consider the finite-precision effects of the proposed low-power DCT architectures. There are two basic considerations in the fixed-point analysis. One is the rounding error. The mean and variance of the rounding error are given by [16, ch. 6] (37) respectively, where is the assigned wordlength. The other is the dynamic range. To prevent overflow in fixed-point implementation, a suitable scaling of the input signal is usually employed according to the dynamic range of the system. In practice, the SNR of the scaled system, SNR , will be degraded by the scaling process and is given by [16, ch. 6] SNR SNR (38) where is the scaling factor, and SNR is the SNR of the original system.
A. Analysis for the Normal IIR DCT
Using the "statistical error model" [16, ch. 6] , the rounding error of the IIR DCT structure in Fig. 2 can be modeled as (39) where is the rounding error caused by the th multiplier. It can be shown that (40) where is the number of the noise sources contributed by the multiplier in the IIR loop:
Due to the presence of , the actual output of the DCT architecture can be represented as (42) where is the output error contributed by . Let denote the transfer function of the system from the node at which is injected to the output, and be the corresponding unit-sample response. From Fig. 2 (40)- (45), we can represent the total noise power at the th DCT channel as (46) As we can see, given the channel wordlength , the rounding error grows linearly with the block size . On the other hand, the noise power is inversely proportional to . That is, the effect of the rounding error in each channel of the IIR DCT greatly depends on the pole locations of the IIR transfer function.
Next, we consider the dynamic range. By examining those nodes in Fig. 2 that may cause overflow, the dynamic range ( ) of the overall IIR DCT structure can be found to be (47) Suppose that a one-time scaling scheme is used at the input end to avoid overflow, and it is done by shifting the data to the right by bits. The scaling factor can be represented as where the fact that , is used [17] . To achieve 40 dB in SNR for the th DCT channel, the optimal wordlength can be computed as
As an example, the 's for the case under the constraint SNR dB are listed in Table V, where  denotes   TABLE V  OPTIMAL WORDLENGTH ASSIGNMENT UNDER THE CONSTRAINT SNR = 40 dB, WHERE N = 8 the averaged system wordlength. As we can see, bits is sufficient to meet the accuracy criteria. Compared with the DCT implementations in [18] and [19] , in which was chosen based on the experimental simulation results, our analytical approach provides more insightful information to determine the architectural specification than the experimental approach.
B. Analysis for the Low-Power IIR DCT with
In Fig. 3 , the power of the injected rounding error can be modeled as (51) Note that , and the total iteration is reduced to . The total noise power at the output becomes (52) From (52), we observe the following.
1) Although the total number of noise sources increases, the total noise power is compensated by the halved number of iterations. 2) Compared with the factor in (46), the factor in (52) will have similar effect on the SNR of each DCT channel but with halved period. Next we apply the technique of "superimposition" to analyze the dynamic range of the multirate DCT architecture. Namely, we first set to zero while analyzing the dynamic range contributed by ; then we perform the same analysis for by setting to zero. The overall can be found from the summation of the two dynamic ranges, which is given by (53)
Using the analytical results in (52) and (53), we can also find the optimal wordlengths for under the 40-dB SNR constraint as shown in Table V . It is interesting to note that the multirate DCT architecture can not only achieve low-power consumption, its numerical property under fixedpoint implementation is also better than the normal DCT architecture. . To verify our analytical results, computer simulations are carried out. As we can see in Fig. 10 , there is a close agreement between the theoretical and experimental results. Also, the SNR distribution is affected by the factor in (54). Fig. 11 shows the averaged SNR for . Compared with the simulation results in [17] , the proposed IIR DCT architectures give comparative SNR performance to the DCT architectures by Hou [7] and Lee [8] under fixed-point arithmetic. It is worth noting that the multirate DCT architectures have better SNR results than the normal IIR DCT architectures.
V. CONCLUSIONS
In this paper, we presented the algorithm-based low-power design of the transform coding kernels using multirate approach. Extension of our designs to low-power 2-D transforms can be achieved by employing the time-recursive 2-D DCT architecture proposed by Chiu and Liu [20] . Another attractive application of our design is in the very high-speed data processing. Suppose that we do not lower the supply voltage for low power consumption. The multirate parallel architectures are in fact high-speed VLSI architectures with speedup of . For example, if we want to perform DCT for serial data at 200 MHz, we may use the parallel architecture in Fig. 4 , in which only 50 MHz adders and multipliers are required. Therefore, we can perform very high-speed DCT by using only low-cost and low-speed processing elements.
