New architecture for high throughput-rate real-time 2-D DCT and the VLSI design by Chiang, Jen-shiun
New Architecture for IIigh hroughput-Rate Real-Time 2-D DCT and 
the VLSI Design 
J a - S h i ~ n  Chiag and Hsiang-ChOu Humg 
Department of Electrical Engineering 
Tamkang University 
Tamsui, Taipei Taiwan 
Abstract -- The discrete cosine transform@CT) has been 
widely used as the core of digital image and video signal 
compression. However, its computation is so intensive and 
is of great necessity to meet the requirement of high speed. 
In this paper, a new architecture for the VLSI design of 2-D 
DCT has been developed. This architecture contains the 
following features: (1) using the programmable logic array 
(PIA) to replace multipliers, (2) overlapped row-column 
operations and pipeline structure to reduce the total 
computation time, and (3) highly modular and regular 
structure for the efficient VLSI implementation. The 
architecture is implemented to a 8 x 8 2-D DCT. The circuit 
is designed by UMC's 0.8 pm spdm CMOS process and the 
cell library is provided by TllU CCL. The simulation is 
shown that the speed of the data processing for this DCT is 
more than 50 MHz. It performs equivalently 800 million 
multiplication and accumulations per second. 
I. INTRODUCTION 
The discrete cosine transform @CT)[l] is an 
orthogonal transform which consists of a set of sampled 
cosine fwnctions in vector forms. The practical 
application of the DCT is a 2-D DCT. The 2-D DCT can 
be represented as z= c'xc, where X is a 2-D input 
data matrix, and c' is the transposition matrix of the 
transform coefficient matrix C. A N-th order DCT 
coefficient matrix C is defined as follows 
A direct implementation of the 2-D DCT is of intensive 
matrix computations. Due to the dense computation 
requirement, the real application of the DCT should be a 
high performance chip. In recent designs [2-61, some of 
them use the Distributed Archtecture @A) with memory 
look-up table to implement the DCT chip [2-41. Others 
apply row-column overlapped operations with multipliers 
[5], and matrix re-permutation with memory look-up 
tables [6 ] .  
The DA architecture [2-41 would save area that 
compared to other designs, but due to the input buffer 
requirement it could not achieve fully pipeline 
performance, and the total computation time is very long. 
The two designs of [5]  and [6] contain the multiplier and 
transposition RAM in the architecture, and that would 
cause reduction of computation speed and long latency in 
total computation cycles. 
A new architecture is proposed to combine the merits 
of the above two designs, matrix re-permutation with 
memory look-up tables and row-column overlapped 
operations. The matrix re-permutation simplifies the 
connections between the accumulator and the PLA 
look-up tables. The elimination of multipliers reduces the 
computation delay sigwficantly, and the row-column 
overlapped operation does not only save the total 
computation time, but also simplify the hardware. 
II. THE NEW ARCHlTECTURE 
The 2-D DCT can be expressed as two cascaded 1-D 
DCT as shown in Fig.l. It is implemented by the 
row-column decomposition technique. We compute the 
8 X 1 DCT of each column of the input data matrix x to 
yield xc . The 8 X 1 DCT of each column of C'X is 
computed to find the desired 8 X 8 DCT. From (1) the 
two stages of DCT coefficient matrix can be re-written as 
follows. 
By the following 
obtained. 
two steps, the 2-D DCT is therefore 
0-7803-3302-0196 $5.00 0 1996 IEEE 21 9 
Authorized licensed use limited to: Tamkang University. Downloaded on March 24,2010 at 02:30:37 EDT from IEEE Xplore.  Restrictions apply. 
Step 1 : 
Y(l1,ml) =CAml,mZ)*Cl(ll,ml) for Z l  = l...N ........... (4) 
C.P= 
Notice that (4) and (5) have independent indices, 
therefore, they can be done in parallel for all their own 
indces. The major Merence between the two equations, 
(4) and (5) is the sequence order of input data. In (4), the 
input data are in the sequence order of ml , and every 
Y(I1,mz) will be computed. Thus, the regular 
multiplication and accumulation can be implemented in 
(4). In (5 ) ,  the data are also supplied in the sequence 
order of m2 and all z(ll, 1 2 )  will be computed in the 
same way as (4). Finally the operations of tran 
and overlapped row-column computation is realized. For 
the 8X 8 2-D DCT, there are only seven kinds of 
coefficients in the DCT coefficient matrix. The matrix 
form can be expressed as follow : 
- - 
l l a b c e d f  
1 - 1 4  a e f - c d  
1 1  4 - b  f d - e - c  
1 1 - u - b - f - d e  c 
1 - 1 4  a -e - f  c 4 
1 - l b - u - d c  f e 
- 1 1  a b - c - e - d - f  - 
1 - 1  b -U d - c - f - e  
1 c  a - d l  e b f 
1 d b -f -1 c -a -e 
1 e - b - c - 1  f a d 
1 f -a -e 1 d -b -c 
1 -f -a e 1 -d -b c 
1 -e -b c -1 -f a -d 
1 -d b f -1 -c -a e 
1 - c  a d 1 -e b -f 
In order to find a more regular and modular structure, a 
permutation matrix P [6] can be used to re-permute the 
column order of matrix C as follows. 
where P =  
1 0 0 0 0 0 0 0  
0 0 0 0 1 0 0 0  
0 0 1 0 0 0 0 0  
0 0 0 0 0 0 1 0  
0 1 0 0 0 0 0 0  
0 0 0 0 0 1 0 0  
0 0 0 1 0 0 0 0  
0 0 0 0 0 0 0 1  
After the re-permutation all rows contain the same 
eight coefficients that can be multiplied by every input 
data. According to the above matrix, the new architecture 
of the DCT is shown in Fig. 3. 
III. THE OVERALL OPERATIONS 
A. The PLA Look-up Tables 
Because all the DCT coefficients are fixed numbers, 
the product patterns of all the input data and coefficients 
can be pre-stored in a PLA table. Here we multiply the 
input data to a constant to find the product. The data size 
of the input is 8-bit wide, and the PLA size for each 
indices is 256. Actually the PLA size can be reduced by 
some arrangement. Let us partition the input data into 
two parts, upper four bits and lower four bits. The upper 
four bits can be multiplied to the constant indices to find 
the partial product and be stored in a PLA table; the 
lower four bits can be multiplied to the same constant 
indices to find the other partial product and be stored in 
the other PLA table. The total product is equal to the sum 
of the two partial product. By this method the hardware 
can be saved almost half of the original approach. The 
block diagram of the PLA look-up table is shown in Fig. 
2. 
B. I-D DCT of the 1st Stage 
The block diagram of the architecture of DCT is 
shown in Fig. 3 .  In Fig. 3 ,  there are multiplexers, 
accumulation adders, and auxiliary registers which are 
connected together to implement the summation and 
transposition with over1 d row-column operations. 
The inputs of the multiplexer are from the output of 
dserent PLA look-up tables. The connections of all the 
multiplexers and PLA look-up tables are due to permuted 
coefficient matrix. Through the multiplexer, the data 
from PLA look-up table are loaded to the accumulation 
adder, and are added to the output of auxiliary registers. 
Thereafter, the sum of the accumulation adder will be 
"latched" by the auxiliary register which are waiting for 
the next data from PLA look-up tables. The summation of 
the matrix computation is recursive, and the result of the 
accumulator will be put to the output register (as shown 
in Fig. 3 )  by every cycle of recursion, and the auxiliary 
register is reset every N cycles of recursion. 
220 
Authorized licensed use limited to: Tamkang University. Downloaded on March 24,2010 at 02:30:37 EDT from IEEE Xplore.  Restrictions apply. 
C. I-D DCT of the 2nd Stage 
The 1-D DCT of the 2nd stage is different from the 1st 
stage in the number of registers. The registers in the 2nd 
stage are seven times more than that in the 1st stage. In 
this stage, the data matrix x Y ( 2 1 , m 2 ) C 2 ( 2 2 , m 2 )  are 
loaded to the accumulation adder one by one in the order 
of m2(column-wise) in each row. A row of 
Y ( ~ I ,  m2)C2(22,  m 2 )  will be added to the N auxiliary 
registers. Each column of the resultant matrix Z(ll, 12) 
will be stored in the N auxiliary registers respectively. 
Then each column in the N auxiliary registers is 
downloaded to each of the N output registers. Afterwards, 
each column of the N output registers will be shifted out 
serially in the order of It(row-wise), and complete the 
transposition and overlapped row-column operations. 
Let us consider the total clock cycles in computing 
the NX N 2-D DCT, there are (NX 1+3) cycles for the 
1-D DCT of the 1st stage and (NX N+3) cycles for the 
2nd stage. For N X  N 2-D DCT, the total computation 
time is p+ N +  6 cycles. The methods shown in [2-61 
require p +2N to 2" clock cycles. Obviously, the 
total computation time has been effectively reduced 
compared to any of the former and conventional 
approaches. 
IV. THE DESIGN OF A 8 X 8 2-D DCT AND THE 
SIMULATION RESULT 
According to the modular and regular structure, the 
chip design is straightforward. The hierarchical modules 
are designed by Verilog HDL, and the logical functions 
are simulated by Cadence's Verilog-XL simulator. The 
Verilog HDL program is synthesized by Synopsys tools 
and CCL cell library, and the circuit lay-out is 
accomplished by Cadence's Cell Ensemble tool to 
automatically place and route. The look-up tables are 
made by the single clock dynamic CMOS PLA to save 
power dissipation and data output latency. The precision 
in the internal arithmetic are also well considered. The 
input data is eight bits, and the intermediate results (after 
1-D DCT) are of 12-bit precision. The final result of the 
2-D DCT is 14-bit precision. 
Design rule of this chip is 0.8 pm spdm CMOS, and 
the cell library is provided by CCL. The total transistor 
count is over 140,000. The synthesized netlists are taken 
to simulate with the time view models of the cell library 
by Verilog, and we find the maximum computation speed 
is 55.6 MHz. The critical path is in the adder and the 
regster stages, and we use carry look-ahead adder and 
simple latch circuit to overcome the &culties. Fig. 4 
shows the simulation result. Table I summarizes the 
design characteristics based on the simulation, and the 
VLSI layout of the DCT is shown in Fig. 5. 
~ 
22 1 
V. CONCLUSION 
A new architecture of the 8 X 8 2-D DCT is presented 
which efficiently combined the merits of replacing 
multipliers with PLA, elimination of transposition 
memory, and overlapped row-column operation. The total 
computation time is reduced by the overlapped 
row-column operation in 1-D DCT, and the data rate 
speeds up due to accumulator and PLA look-up table 
improvement. This approach is feasible for the VLSI 
implementation and is suitable for lllgh speed application 
such as HDTV. 
The transistor count of the designed circuit is over 
140,000. Since the design is finished by the automatic 
synthesis and auto-place-auto-route. If the layout is 
accomplished by the fully customer design, we expect the 
area of the implementation chip can be reduced 
signrficantly, and the speed can thus be even higher. 
These aspects may be the further research attempt. 
VI. REFERENCE 
[ 11 N. Ahmed, T. Natarajan, K. R. Rao, "Discrete Cosine Trausform", 
IEEE Trans. Comput., vol. C-23,pp. 90-93, Jan.1974. 
[2] M T. Sun, L. Wu, and M L. Liou, "A Concurrent Ard&xiure for 
VLSI Implemmtatim of Discrete Cosine Transform", EEE Trans. 
an Circuit and System, vol. CAS-34, NO. 8, Aug. 1987 
[3] J. C. Carla&, P. Penard, J. L. Sicre, "TCAD: a 27 MHz 8 x 8 Discrete 
Cosine T r d o r m  Chip", Proceedings of ICASSP89,pp.2429-2432,1989 
[4] U. Sjosbom, I. Defihpps, M Ansorge, and F. Pellandini, "Discrete 
Cosine Trausfom chtp for Real-Time Video Application", Roc. ISCAS, 
1 9 9 0 , ~ .  1620-1623 
[5] S. P. Kim and D. K. Pan, "Hifly M O ~ U ~ X  and Cm"t 2-D 
DCT chrp",Roceedin~~fISCAS,p~. 1081-1084,1992 
[6] M. H. Sheu, J. Y. Lee, J. F. Wang, k N. Sum, and L. Y. Liu, "A High 
Throughput-Rate Ard&e&ure for 8x 8 2-D DCT", R d g s  of 
1CAsSP93,p~. 1587-1590,1993 
[7] G. U Blair, "PLA Design for SmgleClock CMOS", IEZE JSSC, 
vol. 27, No. 8, Aug. 1992 
X 41 I - D  DCT 1 E)/ I - D  DCT 1 LIZ!) Z 
input  data output da 
Fig. 1 Block diagram of 2-D DCT 
Authorized licensed use limited to: Tamkang University. Downloaded on March 24,2010 at 02:30:37 EDT from IEEE Xplore.  Restrictions apply. 
look-up 
Clock Rate 
"CY 
Transister Count 
Layout Area 
input dotn + 
50 M H Z  (time view model simulated) 
78 clock cycles 
147,839 
6337.8 um X 7715.9 um 
I 
Fig. 2 Block diagram of PLA look-up table 
input data 
I1 
I 
I 
Fig 4 The result of simulation with time view models of cell library 
Fig. 5 The VLSI layout of the 2-D DCT 
I Rule (cMos) 10.8 um mdm CMOS I 
I Cell Library I lTRI CCL's TSMCO8 I 
I pipeline 1100% I 
Table I : Summary of the desigued DCT d q  
Fig 3 The new architecture of 1 -D DCT 
222 
Authorized licensed use limited to: Tamkang University. Downloaded on March 24,2010 at 02:30:37 EDT from IEEE Xplore.  Restrictions apply. 
