This paper presents a fast algorithm along with its systolic array implementation for computing the 1-D Npoint discrete cosine transform (DCT), where N is a power of two. The architecture requires log,N multipliers and can evaluate one complete N-point DCT every N clock cycles. It possesses the features of regularity and modularity, and is thus well suited to VLSI implementation. As compared to existing systolic DCT architectures reaching the same throughput performance, the proposed one involves much less hardware complexity.
INTRODUCTION
The discrete cosine transform (DCT) [I] is one of the most widely used transforms in the area of digital signal processing, because it approaches the statistically optimum Karhunen-Loeve transform for highly correlated signals [2]. Since the DCT computation is rather time-consuming, hardware architectures are often necessary for it to meet the real-time requirements. There have been a great number of methods proposed for fast DCT computation. Among them, the systolic array approach [3] has received great attention. This is due to the fact that a systolic system possesses the desirable features of regularity, modularity, and concurrency for VLSI implementation. There exist a number of systolic designs that can yield one complete N-point DCT per N cycles (see, for example, [4]- [7] ). To our knowledge, each of such architectures involves at least N multipliers, and this might make the system unsuitable for single-chip implementation when long-length transforms are required. In this paper, a new systolic array with log,N multipliers is proposed for computing the 1-D N-point DCT at a rate of one complete transform per N cycles. As compared to previous systolic DCT architectures with the same throughput performance, the proposed one gains a significant improvement in hardware complexity. T z=[zo zl ". It is easy to check that cos@: = (-1)kcos@~N'2, where m =0, 1, 2, ..., N/2-1. With this property, Hou [SI showed that we can partition the transform matrix into four quadrants by shifting all the even-numbered rows of T(N) to the upper half portion. Mathematically, this can be described by the following equation:
A NEW FAST ALGORITHM FOR THE 1-D DCT

Consider
where Q(N)=[e, e2 e4 ... eN., e, e3 e, ... eN.,lT is a permutation matrix with e, being a unit N x l column vector whose (n+l)-th element is 1 and E(N/2) and D(N/2) are given by
To facilitate derivation of the algorithm, we define TN,M(M) as the direct sum [9] 
rT(M) 0 . * -0 1 As indicated in [SI, the matrices E(N/2) and D(N/2) have the following relationship:
where L(N12) is a lower triangular matrix of size (N/2)x(N/2) given as
and F( N / 2) is a diagonal matrix given as 0 1
Substituting (7) into (4), we have
Since cos2k4; = cosk&,2, we can see from (3) and (5) that T( N / 2) = E( N / 2). Thus, (1 0) can be rewritten as
With similar direct-sum definitions for matrices Q N , M ) , PNAM), and WNdM), we can rewrite (1 1) as follows:
Multiplying both sides of this equation by P ; ' ( N ) yields
Note that (14) can check that when the input sequence of the latter circuit is {z,,, q, z4, ..., ZM-2, zl, z3, z5, ..., z~-~} , the corresponding output sequence is {z,,, zl, z,, z3, ..., zM-,, z~-~} .
We can also see from the above that the complete architecture consumes log,N multipliers (neglecting those by 2) and its throughput is one complete N-point DCT per N clock cycles. 
