The Design and implementation of DCT/IDCT Chip with Novel Architecture by 鄭國興
ISCAS 2000 - IEEE International Symposium on Circuits and Systems, May 28-31, 2000, Geneva, Switzerland 
The Design and implementation of DCT/IDCT Chip 
with Novel Architecture 
Kuo-Hsing Cheng*, Chih-Sheng Huang# and Chun-Pin Lin 
Department of Electrical Engineering, Tamkang University, Taipei Hsien, Taiwan, R.O.C. 
E-mail: cheng@ee.tku.edu.tw* cshuang@ee.tku.edu.tw# cplin@ee.tku.edu.tw 
TELr886-2-26215656 Ext.273 1 FAX:886-2-26221565 
ABSTRACT 
In the paper: an efficient VLSI architecture for a 8x 8 two- 
dimensional discrete cosine transform and inverse discrete cosine 
transform (2-D DCTIIDCT) with a new 1-D DCTIIDCT 
algorithm is presented. The proposed new algorithm makes all 
coeficients are positive to simplify the design of multipliers and 
the coefficients have less round-off error than Lee's algorithm [ I ] .  
For computing 2-D DCTIIDCT: the row-column decomposition 
method is used, and the design of I-D DCTilDCT requires only 9 
multipliers and 2 1 addersisubtractors. This chip is synthesized with 
0.6 ,U m standard cell library and 1P3M CMOS technology, and it 
can be operate up to 100MHz. 
1. INTRODUCTION 
The DCT is widely used in video coding and image compression 
such as videoconference and HDTV [2][3]. The fast algorithms for 
computing 2-D DCTiIDCT can be divided into two classes: (1) 
The row-column decomposition methods [4][5]. These methods 
separate the 2-D DCT/IDCT into two 1-D DCT/IDCT with a 
transpose memory. These use I-D fast DCTiIDCT algorithm to do 
the row processing and sent the results into a transpose memory to 
do the row column exchange, and then using I-D fast DCT 
algorithm to do the column processing; (2) The not-row-column 
decomposition methods [6]. These methods direct use the 2-D 
DCTIIDCT algorithm to computing 2-D DCTIIDCT. These need 
less computing stages but cost much more hardware. Therefore, 
these are more suitable for software implementation than hardware 
implementation. 
Our implementation of a 8x 8 2-D DCTIIDCT chip use the row- 
column decomposition method. The most efficiency algorithms for 
computing I-D DCTiIDCT are Lee's [I] and Hou's[7]. Hou's 
algorithm has less round-off error than Lee's algorithm [8][9], but 
some coefficients are positive and some are negative. Generally. if 
A and B are positive, the design of Ax B is easier than Ax (-B), 
Thus, we present a new algorithm that makes all the coefficients 
positive cosine forms to simplify the multiplier designs and have 
less round-off error than Lee's. 
2. THE FAST DCT ALGORITHM 
The normalized 1-D N-point DCT is defined as follow: 
where 
C u  =)": for u = O  
C u = l  ~ for u = O - N - 1  
For simplicity. we neglect the scaling factor g C u  ~ then 
Equation ( I )  can be written as 
N-l 
i=O 
Z',= Cx,.cos-: u z O - N - 1  ( 2 )  
In the following, it is assumed that the N is a power of 2. Let 
U = 2u and U = 2u + 1 to separate Equation (2) into even and 
odd index forms: we have the even index form as 
( 3 )  
(4) 
~ u=O-NL-l  
and odd index forms as 
(2i+l)(Zu+l)n 
N-I 
i=O 
Z'2"+l = x x i  . C O S T  
(2i+I)(Zu-l)n 
. N-l 
i=O 
z'2"-l = cxi . C O S T  
In Equation (4) and ( 5 ) :  when u=O: it implies Z', = z-l . If we 
add Equat ion (4) and ( 5 )  with the t r igonometr ic  formula 
we can get 
cos(a + p)+ cos(a - p) = 2cos acos p 
Z"ZU+I = Z12U+l fZ'2u-l 
= Ex, . 2 c o s ~ . c o s ~  
( 6 )  
N-l 
i =O 
Equation (3) can be written as 
0-7803-5482-6/99/$10.00 02000 IEEE 
IV-741 
N/-I 
ZI2" = [Xi + xN-,-i]. cos- (9) 
,=O 
From 
(10) cos [Z(N-l-i)+l]n - 4- (2i+l)a - - C O S 7  ( 2 i + l ) K  2N - 2N 
and Equation (8): Equation (7) becomes 
N/-I 
z''2u+,= =n [x, - x , ~ , ~ , ] . 2 c o s - . c o s ~  (11) 
We define 
g, =xi+xN- , - ,  i=1 -%-1  (12) 
i = 1 - % - 1  (13) 
(14) 
(15) 
( 2 i + l ) K  hi =[xi - x ~ + , ] . ~ c o s F  
then Equation (9) and (1 1) becomes 
%-I 
G E Z '  = gi.cOs-: 2(  X) u = l - % - l  
,=I) 
U 2u 
x-1 
H, ~ Z " z u + l =  2 hi.cos-' U = l - s - l  
,=o 
Equation (14) and (1 5 )  are both (Nl2)-point DCT; therefore, based 
on the formations derive from above, Equation (14) and (1 5 )  can 
recursively compute until N=2. The signal flow graph for a 8x 8 
DCTlIDCT with the scaling factor is shown in Figure 1. Because 
the DCT is an orthogonal transform, the signal flow graph for the 
IDCT is just the inverse of the DCT. 
DCT iDcr 
'b z, 
7.4 
z, 
7.6 
=, 
Z, 
2, 
zi 
uhcm C, = eor(k;ln) 
Figure 1. The DCT/IDCT signal flow graph 
3. THE SIMPLIFIED OF SIGNAL FLOW 
GRAPH 
before simDlifv 
multipliers 13 
adderslsubtractors 29 
In Figure 1: the DCTlIDCT signal flow graph requires 13 
multipliers and 29 adderslsubtractors. In the following, we will 
simplify the DCTODCT signal flow graph. The simplified flow is 
as follow: 
0 To multiply each row of the signal flow graph in Figure 1 with 
@cos( 144 K )))-I: it can save one multiplier, which is shown in 
Figure 2. 
As shown in Figure 2: both the parts enclosed by dash line and 
real line are the same. If we take away one of them and 
pipeline the signal flow graph appropriately, the hardware can 
save eight adderslsubtractors and three multipliers. Finally, the 
hardware comparison is shown in Table 1. 
after simDilfv 
8 
21 
Figure 3. The timing flow of DCT 
(9)  181 
Figure 4. The timing flow of IDCT 
IV-742 
4. THE VLSI IMPLEMENTATION 
Because 2-D DCTADCT is a separable transform, it can be 
implemented by series of I-D DCTiIDCTs with a transpose 
memory. Figure 5 shows the 2-D DCT/IDCT is implemented only 
one I-D DCTiIDCT unit with a transpose memory. 
I" 
Figure 5.  The implementation of 2-D DCT1IDCT 
Because the direction of signal flow of DCT and IDCT are 
different, each pipelined stage must include a lot of multiplexes. In 
order to solve the complicated routing, we make the DCTiIDCT 
be placed as sandwich form as shown in Figure 6. Therefore. no 
matter to process the DCT or the IDCT: the wires are routed only 
through the control unit. 
l O @ " l  
O"t@"t 
Figure 6. The placement of DCT/IDCT 
Figure 7 shows the data format of the DCT/IDCT chip. The 
DCT/IDCT chip requires seven kinds of the multiplier coefficients. 
We use Booth coding [IO] to reduce the numbers of nonzero bits 
of the multiplier coefficients as shown in Table. 2. 
Figure 7. The data format of DCT/IDCT 
Table 2. The coefficients of DCT/IDCT 
A 6-bitx 6-bit multiplier implementation uses the Wallace tree 
architecture [ I l l  to simplify is shown in Figure 8. For example, 
Figure 9 shows a multiplier with the coefficient which is equal to 
2cos( n 116). It is important that to simplify the partial product and 
sign-bit extension of multiplier. Our simplified flow is as follow: 
0 In Figure 10: let "dddd" be the sign-bit extension: the "dddd" 
can be represent as ''C added with a "all-ones compensation 
vector". Using the method shows in Figure 10: the sign-bit 
extensions in Figure 9 are instead of "all-ones compensation 
vector", and can be collected beforehand 
Use the combine skill of sign-bit extension shows in Figure 
11: the partial product of multiplier in Figure 12 can reduce 
one row. 
0 In Figure 13: the Wallace Tree architecture is used to 
simplify the partial products to two rows. The two rows just 
require fast adder to generate the final product 
- 1  * * * * * *  
0 
* * * * * *  
* * * *" 
Level 2 
* * * *  
0 1 1 ,  I 1  0 I O  I I I 1  0 0 0 0 0  0 0 0 0 . 0  1 0 '  
Figure 9. The coefficient 2cos( K 116) of multiplier 
d -  I 
Figure IO. The eliminated of sign-bit extension 
Figure 11. A combine skill of sign-bit extension 
a n ( P W  
dl7d16dlSdl4d13dlZdlld10 CB CB 07 ffi 6 dl& 82 dl dl 
1 0  o o o o l o l o o o f o  
I 
0 1 1 ldUd16d15dl4d13dVdlid10 d) P m7 m 1 dl m a dl m 
1dl7d)sdl5mm3dl2dlIdm dl P m re m m m a  dl m 
I 1 I I l d 1 7 ~ m s ~ m 3 m d u ~ P P m f f i m d m B m m  0 I 
rmdlSd,5d,Ld13*12dlldT) 6) * dl m m * m a dl m 0 0 0 0 0 1 
Figure 12. The first time simplified of multiplier 
IV-743 
5. THE SIMULATION RESULT 
I21 I( 8 x 8  I DCT/IDCTI 1 2  I 320,000 I 50 
131 1 8 x 8  I D C T I I D C T I  0 8  I 180,000 I 50 
The circuit modules of the chip are designed by verilog HDL. The 
verilog HDL programs are synthesized by the Synopsys tool with 
the 0 . 6 y m  Compass standard cell library. Figure 14 shows the 
layout which use the Cadence Silicon Ensemble tool to do the 
automatically place and route with 0.6 y m lP3M technology. The 
core area is about 3 . 9 1 6 ~  3.916mm2. and the chip can be operate 
up to 100MHz. The features of our DCTiIDCT chip are shown in 
Table 3. and the comparisons with other chips are shown in Table 
4. 
151 
I61 
ours 
Figure 14 The Layout of the 2-D DCTiIDCT 
Table 3 The feature of the 8x 8 2-D DCTiIDCT chip 
8 x 8  DCT 0 8  147,839 50 
8 x 8  IDCT 0 6  402,048 71 
8 x 8  DCTADCT 0 6  155.895 IO0 
Table 4. The comparison of DCTiIDCT chips 
6. CONCLUSION 
In the paper. we propose an efficient architecture to implement a 
2-D DCTIIDCT with a new algorithm. The proposed new 
algorithm makes all coefficients are positive to simplify the design 
of multipliers. The efficient architecture for the proposed 
algorithm requires only 9 multipliers and 2 1 addersisubtractors. 
The transistor count of the designed circuit is less than 160,000. In 
Table 4: the simulation result shows the performance of our chip is 
better than other chips, and is suitable for high-speed application 
such as HDTV. 
REFERENCE 
[ I ]  B. G. Lee: “A new algorithm to compute the discrete cosine: 
transform,” IEEE Trans. Acoust., Speech, and Signal 
Processing, vol. ASSP-32, pp. 1243-1245, Dec. 1984. 
Vishnu. Srinivasan: and K. J. Ray Liu, “VLSI Design o f  
High-speed Time-Recursive 2-D DCTiIDCT Processor for 
Video Applications,” IEEE Transactions on Circuits antl 
Systemfor Kdeo Technology, vol. 6: no. 1: February 1996. 
T. Miyazaki, T. Nishitani, M. Edahiro, M. Edahiro, I .  Ono: 
and K. Mitsuhashi, “DCTIIDCT processor for HDTV 
developed with DSP silicon compiler,” J. VLSI Signal 
Processing, no. 5: pp. 151-158, 1993. 
C. C. Ju: “A High-Throughput DCTiIDCT Architecture antl 
Design Methodology with Application to Real-Time Digital 
Video Codec System and Associated CAD Design:” A4aste.r 
Thesis, National Chiao Tung Univ., Taiwan, June 1997. 
J. S. Chiang and H. C. Huang, “Novel architecture for two- 
dimensional high throughput rate real-time discrete cosine 
transform and the VLSI design,“ IN7: J Electronics, vol. 83, 
no. 4: pp. 519-527, 1997. 
Y. P. Lee, T. H. Chen and L. G. Chen, “A Cost-Effective 
Architecture for 8x 8 Two-Dimensional DCTI IDCT Using 
Direct Method,” IEEE transactions on circuits and systems 
for  video technology. vol. 7, No. 3, pp. 459-467, June 1997. 
H. S. Hou, “A fast recursive algorithm for computing the 
discrete cosine transform,“ Transactions on Computers, vol. 
C-3 1 ~ pp. 899-906, Sept. 1982. 
K. T. Lo and W. K. Cham, “Analysis of Pruning in Fast 
Cosine Transform:’’ IEEE Transactions on Signal Processing, 
vol. 44, no. 3. March 1996. 
H. R. Wu and F. J. Paoloni, “A Two-Dimension Fast Cosine 
Transform Algorithm Based on Hou’s Approach,” IEEE 
Transactions on Signal Processing, vol. 39: no. 2: February 
1991. 
[ I O ]  C. N. Lyu and ,D., W. Matula, “Redundant Binary Booth 
Recoding,“ Symposium on Computer Arithmetic, pp. 50-57: 
July 1995. 
[ 1 I] William J.  Stenzel, William J. Kubitz and Gilles H. Garcia: 
“A Compact High-speed Parallel Multiplication Scheme.’‘ 
IEEE Transactions on Computers: vol. C-26, No. 10: 
pp.948-957. 1977. 
[2] 
[3] 
[4] 
[5] 
[6] 
[7] 
[8] 
[9] 
IV-744 
