Abstract-In this paper, we develop a novel 8 2 8 twodimensional (2-D) discrete cosine transform/inverse discrete cosine transform (DCT/IDCT) architecture based on the direct 2-D approach and the rotation technique. The computational complexity is reduced by taking advantage of the special attribute of complex number. Both the parallel and the folded architectures are proposed. Unlike other approaches, the proposed architecture is regular and economically allowable for VLSI implementation. Compared to the row-column method, less internal wordlength is needed in order to meet the error requirement of IDCT, and the throughput of the proposed architecture can achieve two times that of the row-column method with 30% hardware increased.
I. INTRODUCTION
A MONG various transform techniques for image compression, the discrete cosine transform (DCT) [1] is the most popular and effective one in practical image and video coding applications, such as high-definition television (HDTV). This is due to the fact that it can give an almost optimal performance and can be implemented at an acceptable cost. There are three methods to realize two-dimensional (2-D) DCT: 1) indirect method by the row-column decomposition [2] - [5] , 2) direct method [6] - [9] , such as using polynomial transform [6] , and 3) using other transforms, such as discrete Fourier transform (DFT) and discrete Harltley transform (DHT) [10] . The indirect method has the advantage of regularity for very large scale integration (VLSI) implementation. Therefore, most chips for 2-D DCT had been implemented by indirect method [2] - [5] . In [2] , an 8 8 2-D DCT/inverse DCT (IDCT) processor chip that can be used for high data rate image and video coding, higher than 55 MHz while using lowcost 2-m CMOS technology, is presented. To be applicable to the real-time processing of HDTV signals, [3] develops a 100-MHz 2-D DCT core processor which introduces a fast DCT algorithm and multiplier accumulators based on distributed arithmetic to reduce the hardware amount and enhance the speed performance. Reference [5] also proposes Manuscript received February 6, 1996 ; revised September 9, 1996 . This paper was recommended by Associate Editor N. Demassieux. a 100-MHz 2-D 8 8 DCT/IDCT processor, in which the architecture is entirely multiplierless and highly parallel, but the chip area is only half that of [3] . However, the computation amount of the indirect method is more than that of the direct method. The direct method requires fewer computations, but it incurs the irregularity. Reference [6] provides a direct DCT algorithm based on polynominal transform techniques with the lowest computation amounts known so far based on algorithms. To improve the regularity, [8] presents a fast 2-D DCT algorithm which requires only one-dimensional (1-D) DCT's and additions, instead of using 1-D DCT's, as in the conventional row-column approach. In fact, those direct methods are not suitable for 8 8 DCT VLSI implementation. Nevertheless, the feature of low computation complexity is still attractive. This fact motivated that a low-computation and regular 2-D DCT structure has been researched recently.
In this paper, we propose a cost-effective architecture for 8 8 2-D DCT architecture which bears both the advantages of high regularity and less computation amount. At first, the real number input is mapped into complex number in the 2-D DCT [6] . Then the computation complexity can be reduced by the rotation techniques in the complex number system. For 8 8 2-D DCT, further modification is required to make the architecture more regular, and this results in the fact that the architecture can be folded to an economically allowable size for VLSI implementation. The finite wordlength analysis for IDCT demonstrates that the proposed architecture requires fewer internal bits than other methods. In the following section, we illustrate that an 2-D DCT/IDCT can be realized by only -point 1-D DCT/IDCT's and some additional summations. With some modifications in Section III, we can obtain a more regular architecture for 8 8 2-D DCT/IDCT's. Finally, we analyze the internal wordlength problem and compare the hardware complexity between the row-column method and the proposed design.
II. METHODOLOGY

A. The Mapping of Input Data
The 2-D DCT [1] of an real signal is defined as (1) and for
For convenience, we introduce by neglecting the kernel factor so that (2a) and (2b)
In the following, we will assume to be a power of two. Using the permutation [6] , signal can be permuted as shown at the bottom of the page. Equation (2a) can be rewritten as 
Note that (5) requires in (4) to be computed for all and only a sufficient subset of such that cover all possible values of [6] .
B. The Proposed 2-D DCT Algorithm
As shown in (4), the exponential term could be treated as a rotation. In row-column method, the term should be rotated twice separately, one is for row computation, and another is for column computation. However, with some relation between and , the term W can be realized by rotating only once to reduce computations. Fig. 1 . The mapping from x n ;n to y n ;t when N = 4.
By the following relation where (6) the signal is mapped as . If is fixed, the mapping from to is one-to-one. However, with different , the mapping order is not the same. For example, the case of when is 0, maps to ; when is 1, maps to
. Fig. 1 shows the mapping of inputs from to when . The modulo operation in (6) disappears because the period of W is just . By substituting (6) into (4), (4) can be rewritten as
In (7b), is no longer in the range from 0 to , which is a common attribute of the ordinary transform. Consider the following relation:
where integer and By substituting the above relation into (7b), we can obtain
Let the computation of 's summation be represented by . Then we can find W Although is a complex number, its real part is indeed an -point 1-D DCT, and its imaginary part can be obtained by the relation Im
Im Re
This reveals that can be achieved by calculatingpoint 1-D DCT. Since multiplying does not need any multiplication, but only affects the addition, an 2-D DCT can therefore be realized by -point 1-D DCT's with some additions. Nevertheless, the row-column method needs 2 -point 1-D DCT's. Similar results have been deduced in [8] with a different approach, but its structure is not regular, and so is unsuitable for VLSI implementation. To overcome this problem, the proposed algorithm develops a regular architecture as illustrated in the following section.
III. ARCHITECTURE OF AN 8 8 2-D DCT
To realize the 2-D DCT, the additions are always irregular, especially when is large. However, most proposed video compression standards such as H.261, JPEG [12] , MPEG-1, and MPEG-2 [13] need only 8 8 2-D DCT/IDCT. In this section, we present a regular structure for 8 8 2-D DCT/IDCT and further can fold the architecture to one forth of original size. It should be noted that the IDCT architecture can be derived by the same deduction.
A. The Parallel 8 8 2-D DCT Architecture
As mentioned in [6] , when 8, it is only to compute for all but and . Then the summation of in (7a) with different can be expanded as shown in (9a)-(9e) at the bottom of the page. For implementation, we partition the computation from (9a)-(9e) together with (5) into three stages.
Stage 1) Pre-addition: Computing the values of , and according to (9a)-(9e), respectively. The architecture for realizing the pre-addition is described in Fig. 2 . This architecture includes butterfly computations and two constant multiplications (multiplied by ) for each . For the case of the term in (9d) and (9e) when computing and , both substructures for even and odd are different, as shown in Fig. 2(a) , and is illustrated in Table I . Therefore, the computation of can be achieved by the data flow in Fig. 3 which consists of the following three steps.
Step a) Calculate the summation of in (11) for the values of from zero to seven.
Step b) The output from a) is multiplied by .
Step c) According to , the output from b) is rotated to make the output order to be the increasing order of . Second, both inputs of and , and , are real numbers. Therefore, the real part of and is an even function, and the imaginary part of and is an odd function. Let W 
Thus, we can easily obtain and from by (14) where (˜) denotes the process of complex conjugation, and the index implies that the index in (12) is replaced by . Equation (12) can be realized by step a). Based on the same concept as (11) , the relation between and is concluded as when when Stage 3) Postaddition: Computing and in (5) . Apparently, it is a butterfly operation, as shown in Fig. 4 . 1 and 2, the interconnection is not local, but it is still regular. This feature of regularity can be utilized to fold the architecture, as described in the next subsection. The Stage 2 requires mainly four eight-point complex DCT's, but the computation of with 0, 4 needs an additional butterfly stage.
Nevertheless, this additional butterfly to compute (14) is the same as Fig. 4 . Stage 3 is composed of butterfly adders, as shown in Fig. 4 . Based on the above implementation strategy, the proposed architecture reveals that it is more regular than [8] . , and fed to the folded architecture in a sequential way. Thus, an additional re-ordering circuit is needed to transfer the input with the input order into the desired order. In the re-ordering circuit, Fig. 7 , there are three register files (RF0, RF1, RF2), and each register file can store eight input data. For example, the number "0" in Fig. 7(a) means that the data is temporarily stored in the register file. The desired order is than derived by the appropriate control signal. Similarly, the output port also needs another reordering circuit to rearrange the output sequence. To reduce the cost of post re-ordering, this re-ordering circuit can be merged into the zigzag circuit when the DCT is applied to video standards.
In needed in Fig. 6 . The folded size is now reasonable for chip implementation.
Since the pipeline scheme will be introduced into the proposed architecture, the critical path will appear in the 1-D DCT. That is, the clock rate of the proposed architecture can be as fast as that of the existed 1-D DCT architecture.
IV. FINITE WORDLENGTH ANALYSIS
By reversing the data flow of the above DCT, we can obtain the IDCT architecture. When implementing an IDCT architecture, there are two inherent errors which will reduce the accuracy of the output; one is quantization error of coefficients and another is finite internal wordlength. Therefore, the Joint CCITT/ISO committee has established a specification to evaluate the errors caused by finite wordlength in IDCT [12] . According to this specification, the proposed IDCT architecture will require 10 000 8 8 blocks of random numbers in the range from 256 to 255 as the input.
The plots of overall mean square error versus internal wordlength with different coefficient wordlengths are shown in Fig. 8(a) . The horizontal dashed line is the upper bound for overall mean square error. Fig. 8(b) , (c), and (d) describe the analysis of peak mean square error, overall mean error, and peak mean error, respectively. The above analysis implies that 11-bit coefficient wordlength and 17-bit internal wordlength will be enough to satisfy all the four error requirements. Compared to the row-column approach as proposed by [2] in which the required internal wordlength is 22 b, the proposed approach needs fewer bits. This is because that the proposed architecture requires fewer multiplications. Besides, the input of the row-column method needs to be multiplied by the coefficients twice in a cascade way. Anyway, most inputs of the proposed method need only to be multiplied once.
Based on the same analysis, Table II Fig. 6 , the proposed design can be implemented majorly by two 1-D DCT's and four 4 4 transpose memory, which is just required by row-column design method (requiring one 8 8 transpose memory). In addition, the proposed folded architecture requires 76 extra adders and four extra constant multipliers. To calculate the transistor count for both the row-column method and the proposed method, we adopt the architecture of the FCT [11] for the 1-D DCT block. The transistor count for each element (FA, DFF, etc.) is referred to [14] . According to Section IV, to satisfy all the error requirements of IDCT, the internal and coefficient wordlengths are 18-b and 12-b, respectively, for the proposed method. But the internal wordlength of the row-column method is 22 b. The circular shifter is now implemented by multiplexers. The postreorder is merged into the zigzag hardware and hence can be ignored. The evaluation result shows that the proposed method needs about 30% transistor overhead compared to the row-column method, as illustrated in Table III . However, the proposed design can be fed with 16 inputs at a time, while the row-column method can only be fed with eight inputs. This implies that if the proposed architecture and the rowcolumn architecture have the same internal clock rate, the throughput of the proposed architecture is twice that of the row-column architecture. To double the pixel rate, two input ports or doubling the external clock for pixel input can be used. Nowadays, a 1-D DCT architecture can run at least 100 MHz. Therefore, the proposed architecture can run at least a pixel rate of 200 MHz.
It should be noted that if the bit-serial word-parallel is adopted for implementation, the total transistors will be far smaller than these data in Table III. VI. CONCLUSIONS This research analyzes the 2-D DCT/IDCT algorithm using direct method. The required multiplications are about halved to the row-column algorithm. In order to overcome the irregular problem, a regular architecture derived from the proposed algorithm for the 8 8 DCT/IDCT is developed. After folding the architecture, the chip size will become very allowable economically for VLSI implementation. Traditionally, direct method has less computation complexity but irregularity; on the other hand, the row-column method is more regular with the penalty of requiring more computations. However, the proposed architecture has both the advantages of low computation complexity and high regularity. According to the specification by Joint CCITT/ISO committee on the IDCT, the proposed design needs only a coefficient wordlength of 12 b and an internal wordlength of 18 b. Compared with the row-column method, the evaluated chip size of the proposed architecture is over about 30%, but the throughput is twice that of the row-column method. The above results show that the proposed 2-D DCT/IDCT architecture is more attractive than other methods.
Yung-Pin Lee (S'91) was born in Taipei, Taiwan, in 1969. He received the B.S. degree in electrical engineering from National Taiwan University, Taipei, in 1991. He is currently a Ph.D. candidate in electrical engineering from National Taiwan University and will graduate in 1997.
His current research interests include video and audio coding systems, DSP architecture design, video signal processor design, and VLSI signal processing. Dr. Chen is a member of the honor society Phi Tau Phi. In 1993, she received the Long-Term Paper Award and Xerox Paper Award.
Thou-Ho Chen
Chung-Wei Ku (S'91) was born in Taipei, Taiwan, in 1968. He received the B.S. degree in electrical engineering from National Taiwan University, Taipei, in 1991. He is currently a Ph.D. candidate in electrical engineering from National Taiwan University.
His current research interests include visual signal representation, very low bit-rate video coding, multimedia telephony, and VLSI design for DSP.
