Abstract The performance dependence of modified discrete cosine transform (MDCT) on hardware architecture is investigated. The oddly stacked architecture is found to be superior to direct computation in terms of accuracy, power consumption, and circuit area.
I. INTRODUTION
Improvements in integrated circuit technology have enabled electronic systems to be built on single chips. Optimal implementation of such systems on a chip (SOC) requires detailed study of the tradeoffs between system performance and hardware parameters. In this study, we consider the tradeoffs involved in the implementation of modified discrete cosine transform (MDCT).
The MDCT is employed in subband coding schemes as the analysis filter bank based on timedomain aliasing cancellation (TDAC). It is a critical block of the MP3 audio coding standard, the first international standard t1*21 for digital compression of high-fidelity audio. The high coding gain of the standard arises from its time to frequency mapping by a polyphase filterbank. The outputs from the subband filter are then passed through a MDCT to obtain higher frequency resolution. This process is illustrated in Fig.1 . In the second part of the h4DCT calculation, a discrete cosine transform is performed on the windowed subband data Z, :
Because the windowing operation( is basically a multiplying process, the first part is straightforwad. We focus on the second par-MDCT calculation-in the following discussion.
The biggest impediment to efficient implementation of the MDCT is the length of the windowed data. With lengths of 36 (for normal, start, and stop blocks) and 12 (for short block), the transformation cannot be implemented with traditional fast algorithms which operate on data lengths of 2". Special circuit architectures are needed. The tradeoffs involved in the implementation of these MDCT architectures are discussed in the next section. 
II. ANALYSYS
n 2
The input data sequence zk is multiplied by the transform matrix D=[d where 
35 2 is the final result.
This implementation is suitable for parallel VLSI computing. For a data length of 36, to compute 18 points of output, one needs 2 x 36 x 18 = 1296 multiplications and 2~3 6~1 8 + 1 8 = 1 3 1 4 additions. In hardware, it can be realized with 3 multipliers and 3 adders. The recursive nature of this architecture means that the hardware cost is low. In practice, however, the run time and error are the weaknesses. It requires 36 cycles to calculate one output and 36 x 18 = 628 cycles to produce one MDCT. In addition, fixed-point computation in the chip induces round-off error. This error accumulates in each iteration. The resulting error is therefore much larger in this architecture compared with the previous implementations if the same number of bits is used to represent the numbers. In other words, more bits are required to give the same accuracy. Synthesis results indicate that the round-off error is unacceptable. So we will not consider this architecture in next section.
Bitwidth
In direct hardware implementation of the MDCT, not only does the architecture require examination, the number of bits used to represent the number also deserves attention. In computation of the MDCT with equation (2), the input signal and the transform coefficients are represented by a finite number of bits. The number of bits used to represent the input data (iw) is driven by the dynamic range of the input. One does not have much latitude for optimization. On the other hand, the number of bits used to represent the coefficients C (cw) can be optimized. The key is to find a cw long enough such that accuracy of the algorithm is adequate with minimal delay, power k consumption, and area. It turns out that the optimal cw is architecture dependent, as shown in the next section.
III. SYNTHESIS
In this section, we examine the dependence of MDCT performance on the number of bits used to represent the coefficients (cw). The performance criteria are accuracy (error), power, delay, and area. The error of each implementation is determined by where the X i 's are the MDCT outputs of the hardware and the i i 's represent the exact results. to 12 decrease the error from 35.899 (100%) to 0.708 (1.9%). But the area is increased from 3.07mm2 to 7.68mm2, the power consumption increases from 26.2 normalized unit to 63.5 normalized unit, and the delay increases from 9.3411s to 12.4711s. There is not much improvement in accuracy beyond a cw of 12. So the optimum cw should be 12.
For the oddly stacked implementation, the behavior is similar. For cw's less than 8, the error decreases sharply with increasing coefficient width. For cw's larger than 8, the error remains relatively constant, but the area, delay, and power keep rising as cw increases. A cw of 8 in the oddly stacked implementation achieves the same accuracy as a cw of 12 in direct computation. This is because when the computation is done with fixed-point arithmetic, the round-off error is induced. Increasing the number of arithmetic operations in general increases the error. The number of computations in the oddly stacked MDCT is less than that in direct computation. The error is therefore less. That means the direct computation implementation requires more bits to achieve the same accuracy. In this case, 4 extra bits are needed for the direct computation implementation. Table 1 compares the MDCT implementations by direct computation and by the oddly stacked method. For the same cw of 12, the area of the oddly stacked MDCT is one eighth that of direct computation. Power consumption is one seventh and the accuracy of the oddly stacked MDCT also better. Nevertheless, the direct computation architecture is about 4.5 times faster.
Comparing the two implementations with approximately the same amount of error, we choose cw=12 for direct computation and cw=8 for the oddly stacked method. In this situation, the advantages of the oddly stacked architecture in area and power are further enhanced. Delay of the circuit is also shortened, although it is still 3.5 times slower than direct computation.
The slower performance of the oddly stacked architecture can be improved by parallel hardware implementation. If 4 oddly stacked circuits are used in parallel, the speed can be improved by a factor of 4, while the area and power consumption will be increased by approximately a factor of 4. This situation shown in the last two lines in Table 1 . The area in the oddly stacked implementation is a third that of direct computation, and power is only a half. The oddly stacked architecture is therefore superior to direct computation.
IV. CONCLUSION
In this paper, we compared 3 different hardware implementations of MDCT. Recursive implementation suffers from inaccuracies, while direct computation is inefficient in terms of area and power. The oddly stacked architecture provides the best performance in terms of accuracy, delay, circuit area, and power consumption.
