New systolic architecture for fast DCT computation by Chang,Yu-Tai et al.
A NEW SYSTOLIC ARCHITECTURE FOR FAST DCT COMPUTATION 
Yu- Tai Chang, Chin-Liang Wang, and Ching-Hsien Chang 
Department of Electrical Engineering, National Tsing Hua University 
Hsinchu, Taiwan 300, R.O.C. 
Email : clwang@ee.nthu.edu.tw 
ABSTRACT 
This paper presents a fast algorithm along with its 
systolic array implementation for computing the 1-D N- 
point discrete cosine transform (DCT), where N is a power 
of two. The architecture requires log,N multipliers and can 
evaluate one complete N-point DCT every N clock cycles. It 
possesses the features of regularity and modularity, and is 
thus well suited to VLSI implementation. As compared to 
existing systolic DCT architectures reaching the same 
throughput performance, the proposed one involves much 
less hardware complexity. 
1. INTRODUCTION 
The discrete cosine transform (DCT) [I] is one of the most 
widely used transforms in the area of digital signal 
processing, because it approaches the statistically optimum 
Karhunen-Loeve transform for highly correlated signals [2]. 
Since the DCT computation is rather time-consuming, 
hardware architectures are often necessary for it to meet the 
real-time requirements. There have been a great number of 
methods proposed for fast DCT computation. Among them, 
the systolic array approach [3] has received great attention. 
This is due to the fact that a systolic system possesses the 
desirable features of regularity, modularity, and concurrency 
for VLSI implementation. There exist a number of systolic 
designs that can yield one complete N-point DCT per N 
cycles (see, for example, [4]-[7]). To our knowledge, each of 
such architectures involves at least N multipliers, and this 
might make the system unsuitable for single-chip 
implementation when long-length transforms are required. In 
this paper, a new systolic array with log,N multipliers is 
proposed for computing the 1-D N-point DCT at a rate of 
one complete transform per N cycles. As compared to 
previous systolic DCT architectures with the same 
throughput performance, the proposed one gains a 
significant improvement in hardware complexity. 
2. A NEW FAST ALGORITHM FOR THE 1-D DCT 
Consider the 1 -D N-point DCT defined by 
0-7803-3073-0/96/$5 .OO '1996 IEEE 485 
where Ek=l/& if k=O and Ek=l otherwise. In the following, 
N will be assumed to be a power of two and the scaling 
factor m E k  will be neglected for convenience. Let x,, = 
yzn and x~-,,-~ = Y~ , ,+~  , where n=O, 1, 2, . . ., "2-1, then the 1- 
D DCT computation becomes 
Furthermore, with the notation of 4% = (4n + I)x / 2 N ,  we can 
express (2) in matrix-vector form as follows: 
T z=[zo zl ". ZN-11 
It is easy to check that cos@: = (-1)kcos@~N'2, where 
m =0, 1, 2, ..., N/2-1. With this property, Hou [SI showed 
that we can partition the transform matrix into four 
quadrants by shifting all the even-numbered rows of T(N) to 
the upper half portion. Mathematically, this can be described 
by the following equation: 
Q(WZ = Q(N)T(N)X 
(4) I x ,  E(N / 2) E(N / 2) =[ D(N / 2)  -D(N / 2) 
where Q(N)=[e, e2 e4 ... eN., e, e3 e, ... eN.,lT is a permutation 
matrix with e, being a unit N x l  column vector whose 
(n+l)-th element is 1 and E(N/2) and D(N/2) are given by 
1 1 1 ... 
N y= 
- 
1 0 0 a . .  0 
-1 2 0 -.* 0 
1 -2 2 0 . . . .  . . . . . .  . . .  . .  
-1 2 -2 - 0 .  2 - 
To facilitate derivation of the algorithm, we define 
TN,M(M) as the direct sum [9] of N/M T(M)’s of size 
MxM, i.e., 
TN/ M (  M) = T( M )  @ T( M) @ C3 T( M )  
rT(M) 0 . * -  0 1  
As indicated in [SI, the matrices E(N/2) and D(N/2) have the 
following relationship: 
D(N/2)=L(N/2)E(N/2)F(N/2), (7) 
where L(N12) is a lower triangular matrix of size 
(N/2)x(N/2) given as 
L(N / 2) = 
and F( N / 2) is a diagonal matrix given as 
0 1  
Substituting (7) into (4), we have 
Q(N)Z 
W(N)X . 
Since cos2k4; = cosk&,2, we can see from (3) and (5) 
that T( N / 2) = E( N / 2). Thus, (1 0) can be rewritten as 
l o  0 0 T(M)j 
With similar direct-sum definitions for matrices Q N , M ) ,  
PNAM), and WNdM), we can rewrite (1 1) as follows: 
Qi (N)Z = Pi (NIT2 ( N  / 2)Wi (NIX. (13) 
Multiplying both sides of this equation by P;’ ( N )  yields 
[PCi(WQ1(N)Z1 = T2(N 1 2)[WI(WX1. (14) 
Note that (14) can be regarded as a transform with transform 
matrix T2(N/2), input vector W,(N)X, and output vector 
P;’ (N)Q (N)Z . Since T2(N/2)=T(N/2)@T(N/2), the original 
N-point transform can be decomposed into two N/2-point 
transforms with transform matrix T(N/2) each. 
Using the permutation matrix Q(NO)=[e, e2 e4 ... eN/2-2 e, 
e3 e5 ... eNI2J for each N/2-point transform (or the 
permutation matrix Q2(N/2)=Q(N/2)@Q(N/2) for the whole 
transform given in (14)) and following (4)-(14), we can 
further partition each N/2-point transform into two N/4-point 
transforms with transform matrix T(N/4) each. The results 
are summarized as follows: 
T 
Q2 (N 1 PI'^ (WQ1 (WZI 
= Q ~ ( N / ~ ) T ~ ( N / ~ ) [ W I ( N ) X I  (15) 
= P2(N/2)T4(N/4)W2(N/2)[Wi(N)X] 
[PT1(N /2)Q2(N 1 2)Pi1(N)Qi(N)Zl (16) 
= T4(N/4)[W2(N /2)W1(N)X] 
where P2(N/2)=P(N/2)@P(N/2) and W2(N/2)=W(N/2)@ 
W(N/2). Repeating such a decomposition process until that 
TN(l)=IN (an identity matrix) appears, we can derive 
Z = Qi1(N)P1(N)QT1(N/2)P2(N/2) - - a  
. * *  Q ~ ’ / / ~ ( ~ ) P N / Z ( ~ ) W N / ~ ( ~ )  w ~ 4 ( 4 ) . * *  (17) 
This equation means that we can compute the 1-D DCT by 
performing a series of matrix-vector multiplications. 
... W2(N /2)W1(N)X 
3. SYSTOLIC IMPLEMENTATION OF THE FAST 
1-D DCT ALGORITHM 
486 
Fig. 1 depicts a block diagram to evaluate the l-D N-point 
DCT based on (17). It consists of three types of basic blocks 
for matrix-vector multiplications with coefficient matrices 
WN,&14), PN.M),  and Q >l, ,,,, ( M ) , respectively, where 
Wl(N)=W(N) and P,(N)=P(N). All the basic blocks are 
realized in systolic form and the corresponding circuits are 
shown in Figs. 2 - 4. The circuit of Fig. 2 performs the 
matrix-vector multiplication with W,AM); when the input 
sequence is {xo, xl, ..., X M , - ~ ,  xM2, xM2+], ..., x d ,  the output 
sequence is {%+xM2, X ~ + X M ~ + ~ ,  ..., XM,~-~+XM-~, cosg.(xo- 
xM2), cosga.(xl-xM2+I), ..., C O S ~ ~ - ’ ~ ( X ~ ~ - ~ - X ~ ~ ~ ) } .  The 
circuit of Fig. 3 performs the matrix-vector multiplication 
with P N d M ) ,  and the circuit of Fig. 4 performs the matrix- 
vector multiplication with ( M )  . With little effort, one 
can check that when the input sequence of the latter circuit is 
{z,,, q, z4, ..., ZM-2, zl, z3, z5, ..., z ~ - ~ } ,  the corresponding 
output sequence is {z,,, zl, z,, z3, ..., zM-,, z ~ - ~ } .  We can also 
see from the above that the complete architecture consumes 
log,N multipliers (neglecting those by 2) and its throughput 
is one complete N-point DCT per N clock cycles. 
Fig. 5 shows a chip layout of the proposed architecture 
for the 128-point DCT, where the data wordlength used is of 
16 bits. It is designed based on a standard cell library for 0.8 
pm CMOS technology. The chip requires a die size of about 
8.162~7,592 mm2 (containing 268,000 transistors) and is 
able to operate at a clock rate up to 20 MHz. Such area-time 
performance supports that the proposed architecture is 
attractive for use in applications where large transforms are 
required. 
0 
4. CONCLUSION 
A new fast algorithm along with its systolic VLSI 
implementation has been presented for the l-D N-point DCT 
computation. As compared to existing related architectures 
in [4]-[7], the proposed one reaches the same throughput but 
reduces the number of multipliers from O(N) to 0(10g2N). 
ACKNOWLEDGMENT 
This work was supported by the National Science Council of 
the Republic of China under Grant NSC 84-22 15-E007-046. 
REFERENCES 
N. Ahmed, T. Nabrajan, and K. R. Rao, “Discrete 
cosine transform,” IEEE Trans. Comput., vol. (2-23, pp. 
R. J. Clack, “Relation between the Karhunen-Loeve and 
cosine transform,” IEE Proc., pt. F, vol. 128, pp. 359- 
360,1981. 
H. T. Kung, “Why systolic architectures?,” Computer, 
U. Totzek and F. Matthiesen, “Two-dimensional 
discrete cosine transform with linear arrays,” in hoc. 
Int. Conf. Systolic Arrays (Systolic Array Processors), 
Killarney, Co., Kerry, Ireland, 1989, pp. 388-397. 
N. I. Cho and S. U. Lee, “DCT algorithms for VLSI 
parallel implementations,” IEEE Trans. Acoust., 
Speech, Signal Processing, vol. 38, pp. 121-127, Jan. 
1990. 
L.-W. Chang and M.-C Wu, “A unified systolic array 
for discrete cosine and sine transforms,” IEEE Trans. 
Signal Processing, vol. 39, pp. 192-194, Jan. 1991. 
C.-L. Wang and C.-Y. Chen, “High-throughput VLSI 
architectures for the l-D and 2-D discrete cosine 
transforms,” IEEE Trans. Circuits Systems Video 
Technol., vol. 5, pp. 3 1-40, Feb. 1995. 
H. S. Hou, “A fast recursive algorithm for computing 
the discrete cosine transform,” IEEE Trans. Acoust., 
Speech, Signal Processing, vol. ASSP-35, pp. 1455- 
1461, Oct. 1987. 
S. H. Frieddberg, A. J. Insel, and L. E. Spence, Linear 
Algebra, 2nd Edition. Englewood Cliffs, NJ: Prentice- 
Hall, 1989. 
90-93, Jan. 1974. 
vol. 15, pp. 35-46, Jan. 1982. 
Fig. 1. A systolic architecture for computing the l-D N-point DCT. 
487 
: one-cycle delay . . . . . .  1 1  cos+$'". ' cos+:, cos+$ +- 
MI2 MI2 
Fig. 2. A circuit for the matrix-vector multiplication with 
matrix WN,,&4) in Fig. 1 .  
Ml2-1 Ml2+1 
AA 
...... 1 1 1 1 - 0 0  0 0  
....................................................... 
Fig. 3. A circuit for the matrix-vector multiplication 
with matrix PNdA4) in Fig. 1. 
MI2 MI2 
-A ...... 1 1  1 a - .  0 0 0 ,  
~ M 
M 
...... -0, 
... ... ... 0 0 0  .... 
if c=l , then y c  z 
else z'+ z and y +"high impedance state" : one-cycle delay 
U 
Fig. 4. A circuit for the matrix-vector multiplication with matrix QN,$(A4) in Fig. 1. 
Fig. 5. A CMOS layout of the proposed 1-D 128-point DCT architecture. 
4 88 
