Abstract-In this paper, the efficient memory-based VLSI arrays and the accompanied new design approach for the discrete Fourier transform (DFT) and discrete cosine transform (DCT) are presented. The DFT and DCT are formulated as cyclic convolution forms and mapped into linear arrays which characterize small numbers of 1 / 0 channels and low 1 / 0 bandwidth. Since the multipliers consume much hardware area, the presented designs utilize small ROM's and adders to implement the multiplications, which is based on good data arrangements exploiting the number properties of the transform kernels. Moreover, the ROM size can be reduced effectively by arranging the data in our designs appropriately. Typically, to perform l-D N-point DIT and DCT, the arrays need N X 2'-words of ROM only. Compared to the conventional distributed arithmetic architectures which should require N X 2 N words of ROM, much memory can be saved if N is greater than L, which occurs in most D F I applications. To summarize, the presented arrays outperform others in the architectural topology (local and regular connection), computing speeds, hardware complexity, the number of I / 0 channels, and I / 0 bandwidth. They take the advantages of both systolic arrays and the memorybased architectures.
I. INTRODUCTION E DISCRETE Fourier transform (DFT) and dis-
T" crete cosine transform (DCT) are the key functions widely used in many significant image and signal processing applications. Because of the high computational complexity, the derivations of efficient algorithms suitable for VLSI are inevitable in many real-time applications. In the literatures, a variety of algorithms have been proposed for computing the D I T and DCT. Since each algorithm has its own specific property and application field, not all the algorithms are well suited for VLSI implementation. The efficiency of an algorithm to be implemented in VLSI is based more on the degree of the communication complexity required among arithmetic elements rather than on the number of computations. Hence, the fact having been observed by many researchers [1]- [8] is that fast Fourier Manuscript received February 18, 1992; revised July 24, 1992 . This work was supported by the National Science Council of Taiwan under Contract NSC81-0404-E009-134, and by the Telecommunication Laboratory of the Ministry of communication. This paper was recommended by Associate Editor I. Shirakawa.
J. Guo and C. Jen are with the Department of Electronics Engineering and Institute of Electronics, National Chiao Tung University, Hsinchu, Taiwan, ROC.
C. Liu is with the Department of Computer Science and Information Engineering, National Chiao Tung University, Tsinchu, Taiwan, ROC.
IEEE Log Number 9205006.
transform (FFT) like algorithms which have been used extensively for their low numbers of multiplications are not well suited for VLSI implementation. Systolic arrays [l] , [2] can meet the increasing requirements of processing speeds and be well suited for VLSI implementation. They attain high processing speeds through parallel and pipeline processing, and make the VLSI implementation feasible through modularity, structural regularity, and local interconnection. We refer to the paper in [3] for the motivations of systolic array architectures over others. When systolic arrays are used as the design vehicle, the speeds and implementation benefits are able to be pronounced only if a large number of low-cost processing elements (PE's) can be implemented in a VLSI chip. In the existing systolic arrays for DFT and DCT, multipliers are the fundamental computing elements in PE's. Since multipliers should consume a large silicon area, the limited chip size should put a severe limitation to the allowable number of PE's. Passively, such arrays [2]- [6] , [8] -[141 should wait for the advent of VLSI technology such as wafer-scale integration to make their benefits visible. Constructively, systolic arrays and the encapsulated algorithms should be developed to simplify the structure and complexity of PE's. Based on this point, this paper presents a new approach to design the VLSI arrays for DFT and DCT. Since this approach derives algorithms based on the data permutations introduced in [8] , [14] , the designed arrays possess better performance in the computing parallelism, computational complexity, and 1 / 0 cost than the designs in [2]-[5], [9]-[ll] do as analyzed in [81, 1141. Also, this approach considers the efficiency of hardware implementation and provides an efficient way to replace multipliers by small ROM's such that the designed arrays can attain high computing speeds at the expense of a small silicon area.
Owing to the regular and compact structure of ROM's, the methods to replace multipliers by ROM's have been studied by numerous researchers (see the references in [161). Among them, distributed arithmetic (DA) has been successfully applied to implement a 16 x 16 DCT in a single chip [20l and widely adopted for commercial products [211-[29] . DA is a technique that computes the multiplications involved in an inner-product by a series of memory access and accumulation operations. If the vector length and the wordlength of input data are assumed to be N and L , respectively. DA typically performs an inner-product by a ROM with size equal to 2N words in L 1057-7130/92$03.00 0 1992 IEEE steps in a bit-serial manner. When applying DA to DCT or DFT, the combination of bit-serial and bit-parallel operations renders the implemented chip requiring a large amount of shift registers or buffers [201-[29] . Also these architectures [191-[291 suffer from the imbalance between the wordlength L and the vector length N. Based on total ROM size of N X 2N words, the number of operation steps required to perform a l-D N-point DCT or DFT is determined by the larger value of N and L. In this paper, a new technique is presented to efficiently replace the multipliers by ROM's for DFT and DCT. This technique leads to an architecture which attains structural regularity and modularity among PE's like systolic arrays. The operations of PE's are performed simply by ROM's and adders, which is like the style of DA. The total ROM size and the number of operation steps for the presented architectures to perform l-D N-point DCT and DFT are about N X 2 L words and N, respectively. If N is greater than L , which occurs in most DFT problems, the presented designs shall require lower hardware cost in ROM's than the DA architectures do. Moreover, the presented architectures operate in a bit-par allel manner which is different from that of the DA architectures. Thus, they are free from the large amount of shift registers or buffers. To sum up, the new approach presented in this paper can be used to design the VLSI architectures f or the DFT and DCT, which can take the advantages of both systolic arrays and t he architectures based on memory.
For the purpose of showing the essentials of this approach concretely, the characteristics of the approach are discussed in the following. In the first place, to attain high computing parallelism, low computational complexity, and the attractive feature of linear arrays that the 1/0 bandwidth as well as the number of 1/0 channels can be kept independent of the array length, the DFT and DCT are formulated as cyclic convolution forms [SI, [14] . That is, the structure of Galois Field is utilized to permute the input and output data such that the DFT and DCT are formulated as cyclic convolution forms and mapped into linear arrays. Hence the designed arrays possess outstanding performance in the computing parallelism, computational complexity, the number of 1 / 0 channels and 1/0 bandwidth. As has been discussed in [151, the high 1/0 bandwidth required for most systolic arrays would limit the computing speeds. Therefore, reducing the high 1/0 bandwidth is capable of enhancing the computing speeds at the same time. Secondly, we modify the cyclic convolution forms in order to replace multipliers by ROM's efficiently. Fig. 1 illustrates the motivation of this modification. If a multiplier with two time-variant operands a and b is directly replaced by a ROM as shown in Fig. l(a) , the required ROM size equals to 22L words which are too large to be practical in hardware realizations, where L is the input data wordlength. Based on the modified forms, one of the operands in the multiplier can be fixed. Therefore, one multiplier can be replaced by 2 L words of ROM as shown in Fig. l(b) . Moreover, as has been discussed in 1201, a technique named partial sums can be used to reduce the memory size in DA architectures. This technique can also be applied to the presented designs to reduce the ROM size from 2L words to 2(L/2+1) words as shown in Fig. l(c) . After using these two techniques illustrated in Fig. l(b) and (c), a multiplier can be efficiently replaced by ROM's with size equal to 2(L/2+ ' 1 words and an adder. Furthermore, owing to the small ROM size, short ROM access time can be attained to benefit the computing speeds. Considering for example the computation of a 17-point DFT of real inputs, only about 1K words of ROM are needed if the input wordlength is 8 bits. The rest of this paper is organized as follows. Section I1 presents the new systolic algorithm for l-D DFT and DCT. Section I11 illustrates the hardware realizations of the presented algorithm. Section IV gives a conclusion.
ALGORITHM DERIVATION AND ANALYSIS

A. The Deriuation for Cyclic Convolution
The l-D DFT and DCT of the input sequence { y ( i ) ,
.., N -11 denotes the kernels of transforms. Generalizing from our previous approaches [SI, [14] , we can formulate (1) as
where
and the sequence { x ( i ) , i = 0, l;.., N -1) is defined as
The value of rn is determined by the following equation
where "gi" denotes the result of "g' modulo N operation" for short and "g" is a primitive element. The details of the derivation from (1) to (2) for the DFT and DCT can be found in Appendix A and Appendix B, respectively. From (2~1, we know that the sequence { T ( k ) , k = l;..,
is the cyclic convolution of the sequence { x ( i ) , i = l,..., N -1) and the kernels {h(g"k), i, k = l;.., N -1). To illustrate the difference between (1) and (2) clearly, the matrix representations of 5-point DFT based on (1) and (2) are individually shown in the following
where "W" and "g" are assumed to be exp ( -j 2~/ 5 ) and 2, respectively. Note that the elements in the matrix of (3b) have the same value in the same diagonal line exclusive the 1's in the first row and the first column while those of (3a) do not possess similar phenomenon. If (3a) is directly realized by using the linear arrays similar to that in [91, each PE should have one input channel to receive a kernel value, W', at each time step. Totally, the number of W"s to be transmitted to the arrays is N 2 . Such arrays require large numbers of 1/0 channels and high 1 / 0 bandwidth. On the other hand, if the special phenomenon of W"s in the matrix of (3b) is efficiently utilized, the W"s can be transmitted to the arrays only through one input channel at the array boundary and the number of W"s to be transmitted to the arrays is only N . Hence, the number of 1/0 channels and the 1/0 bandwidth are reduced by a factor N . Moreover, (3b) should induce high computing parallelism and low computational complexity as analyzed in [8] . In the following subsection, we shall further modify the algorithm so that the multipliers in the arrays can be replaced by ROM's efficiently. Fig. 1 illustrates the basic motivation for the further derivation of (2). As shown in Fig. l(a) , if a multiplier with two time-variant operands a and b is replaced by a ROM, the required ROM size equals to 22L words, where L is the input wordlength. If one of the operands i n . the multiplier is fixed, then the multiplier can be replaced by a ROM with size equal to 2L words as shown in Fig. l(b) . Since the ROM is used to perform multiplications, it can be further replaced by two small ROM's and an adder as shown in Fig. l(c) . The size of each ROM equals to 2L/2 words and the total size of the ROM's equals to 2 L / 2 + 1 words. This partition scheme is similar to the partial sums technique used in [20] . As can be noted, the partition scheme induces an additional adder although the memory size is reduced. Further partitions of the ROM's are possible but the trade-off lies in the cost between memory and adders. It has been analyzed based on an implementation technology that the number of times to partition a ROM with 8-bit addresses is one [20] . In the rest of this section, (2) is further modified such that the multipliers used in an array can be replaced by ROM's based on the method illustrated in Fig. l(b) . Then, we shall partition the ROM's based on the method illustrated in Fig. l(c) . It is arbitrarily assumed in this paper that the number of times to partition the ROM's is one.
B. The Deriuation for EfJicient ROM Substitution
Considering (3b), the sequence { x ( i ) , i = 0, l;.., N -1) is time-variant. The W"s are transmitted from the PE at the array boundary through the internal PE's for proper computations. Hence, the two operands of multipliers in the PE's are both time-variant. As illustrated in Fig. l(a) , such multipliers cannot be replaced by ROM's efficiently. In order to efficiently replace multipliers by ROM's, (2c) can be reformulated as
based on the commutative property of cyclic convolution. To illustrate that (4) benefits the efficient ROM substitution, the matrix representation of 5-point DFT based on I (4) can be expressed as 4 3 ) 4 1 ) 4 2 ) x(4) where "W" and "g" are assumed to be exp ( -j 2~/ 5 ) and 2, respectively. It is noted from (5) that the input data x(i)'s are transmitted among PE's and are time-variant, but the ( N -1) W"s are respectively allocated to ( N -1) PE's and are time-invariant. Hence, one operand in a multiplier is fixed and the multipliers used to implement the multiplications in ( 5 ) can be replaced by ROM's and adders based on the methods illustrated in Fig. l(b) and (c). In the following section, the architectures that realize the DFT and DCT based on this algorithm are presented. cost function of hardware complexity and the time delay are respectively about 272 and 14t where t is the gate delay time. If a complex multiplication is implemented by two multipliers as shown in Fig. 3(b) , the cost function of hardware complexity and the time delay are about 1920 and 32t, respectively.
To help analyze the activity of the array shown in Fig. 2 
where "W" and "g" are assumed to be e x p ( -j 2~/ 5 ) and 2, respectively. Fig. 2 shows the memory-based systolic array for 5-point DFT where consecutive DFT calculations are assumed. The first input and output data bundles are denoted as xl(i) and Yl(k), the second input and output data bundles are denoted as x2(i) and Y2(k), and so on. The time instants for the input and output data bundles are indicated in the same row of each data. Analyzing the array shown in Fig. 2 , the input data are piped in from the left-most PE while the output data are drained out from the right-most PE. Hence, the 1/0 channels are all located at the boundary PE's which makes the 1 / 0 cost independent of the array length N . The ( N -1) twiddle factors, W"s are stored in ( N -1) PE's, respectively. As a result, each multiplier in the PE's can be efficiently replaced by two small ROM's and an adder as illustrated in Fig. l(b) and (c). Fig. 2(b) and (c) illustrate the functions and structures of the PE's in the array. Fig. 2(d) illustrates the permutation stage of the array which performs the permutations and order arrangements of the input data. A RAM buffer and an address generation unit are used to implement the data permutations. To illustrate the advantages of the ROM substitution, the analysis model presented in [30] is used to analyze the hardware and time complexity. If a complex multiplication is implemented by ROM's with size equal to 25 words and two 8-bit adders as shown in Fig. 3(a) , the
Fig. 4 depicts the activity of the array shown in Fig. 2 at successive six clocks from t = 8 to t = 13, where ypb denotes the iterated result yL in (7) of the pth data bundle. The right-most PE has a 2-bit control link named "Tag2" and all the PE's have the 1-bit control links named "Tagl." Link "Tagl" is used to indicate the PE's to select the appropriate input data, and link "Tag2" is used to indicate the right-most PE to perform the correct operations. Based on the control scheme named "Tug control" [18] , the data in the local registers of each PE can be controlled from the input channels at the extreme ends of a linear array. The hardware overheads paid for this control scheme in each PE are about a 1-bit link and one multiplexer. The time overhead is ( N -l)Tqcle, where Tqcle is the cycle time of the array. However, the time overhead can be skipped by overlapping the computation time of two consecutive DFT calculations. As depicted in Fig. 2 , there is no extra time between the data bundle xl(i)'s and the data bundle x2(i)'s, or between the data bundle Yl(k)'s and the data bundle Y2(k)'s. In other words, this control scheme should give overhead to the latency time instead of the average computation time for a DFT problem. The latency time is defined as the consumption time from the input of the first datum to the output of the final datum for a DFT calculation. The average computation time is defined as the minimum execution time between the first datum of the current data bundle and the first datum of the next data bundle for consecutive DFT calculations. This phenomenon can also be checked from the array activity shown in Fig. 4 .
From t = 8 to t = 13, the array calculates the first DFT problem by using xl(i)'s and simultaneously fetches x2(i)'s for the second DFT problem. It is such a concurrent computing style that favors the average computation time of the presented DFT array. Due to the modularity of the presented array, it is very easy to extend the 5-point DFT array to the long length one based on the same topology. It is noted from Fig. 2 that the overall hardware cost of the DFT array is linearly proportional to N , and the number of 1/0 channels is independent of N . If N becomes large enough to induce unacceptable hardware cost, the efficient partition techniques which have been investigated in our previous paper [8] can be used to realize the presented array with a reasonable number of PE's.
Including the cost of permutation stage, the overall hardware cost of the designed array for N-point DFT consists of ( N -1) X 2 L / 2 + ' words of ROM, 2 N adders, N + 5 multiplexers, one RAM module with size equal to 2 N -2 words, an address generation unit, and 2 N -2 shift registers, where the number of ROM partition time is assumed to be one. It is seen that the hardware cost of the presented array is only proportional to N . However, the architectures using DA approach should require for computing an N-point transform, where the number of ROM partition time is also assumed to be one. It is noted that the hardware cost of the architectures using DA approach increases exponentially as N increases.
Therefore, the presented approach requires less hardware cost than the DA approach does when of N > L , which occurs in most DFT applications. In the following, the analysis model presented in [30] is used to evaluate the cost of the designed arrays. A cost function derived in [301 is adopted for objective cost evaluation. According to this model, the cost function of the permutation stage over that of the whole DFT array is about (37N + 73)/(336N + 135). As N becomes large, the cost of the permutation stage is about 11% of the overall cost of the N-point DFT array, where the wordlength is assumed to be 8 bits. This percentage is not affected by the values of N .
To sum up, the designed array has several distinctive features. In the first place, the input data and the computed results are piped in and drained out from the 1/0 channels at the extreme ends of a linear array. Hence, low 1 / 0 bandwidth and a small number of 1 / 0 channels can be achieved. Secondly, all the multiplications are efficiently realized by ROM's and adders to attain the benefits in hardware realization and computing speeds. Thirdly, the presented architecture takes the advantages of systolic arrays such as locality, modularity, pipelinability, and parallelism among PE's. Also, it utilizes the memory-based implementation to attain low hardware cost and high computing speeds inside PE's.
B. The Hardware Architecture for I-D 7-Point DCT
Based on the presented algorithm, DCT can be formulated as 
The value of m used above is determined by the following equation 3' + m x 7 = 3lPk x 3 k ; i, k = 1,2,...,6 (9) where ''31'' denotes the result of "31 modulo 7" operation for short. For the purpose of showing the presented -cos (6a) Tag2 <-Tag2; Xl'<-Xl;
x3' <-x 3 x4' <-x4; XS <-x5;
x6 <-x6;
yl'<-yl +z'C; ;:: Similar to the D I T array, the cost of the preprocessing stages as well as the overall cost of the DCT array are both linearly proportional to N . This fact reveals that the percentage of the cost occupied by the preprocessing stages over the overall cost of the DCT array is finite and not affected by the values of N . Moreover, all the features of the DFT array are also possessed in the DCT array, which include high computing speeds, low hardware cost, low hardware complexity of PE's, a small number of 1/0 channels and low 1/0 bandwidth. As a whole, the presented D l T and DCT arrays not only provide good performance in the hardware complexity of PE's, 1 / 0 cost, and throughput rate, but also possess the feasible V U 1 structures inside and among PE's.
IV. CONCLUSION
The efficient memory-based VLSI array designs for the DFT and DCT have been presented. A new design approach for the designed arrays has also been presented in this paper. This approach has been shown to provide the method to derive systolic algorithms for linear arrays and give an efficient technique to replace multipliers by ROM's. Two linear systolic arrays have been designed for the DFT and DCT individually. The designed arrays have been shown to have good performance in the architectural topology (local and regular connection), cornputing speeds, hardware complexity, the number of 1 / 0 channels, and 1/0 bandwidth.
In a few words, the presented approach formulates the DFT and DCT as cyclic convolutions, maps the convolutions into VLSI arrays, and considers the issue of efficient hardware implementation such as using small ROM's for multipliers. This approach can also be applied to other applications. For example, the transform kernels of the discrete sine transform (DST) have similar properties to those of the DCT [lo] . Besides, the kernels of the discrete Hartley transform (DHT) and the DFT have similar forms to each other. Therefore, based on the presented approach, the DST and DHT can also be formulated as cyclic convolutions, mapped into V U 1 arrays, and implemented by using small ROM's instead of multipliers. If long length DFT and DHT are considered, the efficient partition techniques which have been discussed in our previous paper [8] can be used to effectively reduce the hardware cost in the designed arrays.
A restriction for the presented approach is that the transform length N should be prime. This restriction should not put a severe limitation for the DFT because a non-prime length input sequence can be appended by zeros to attain prime length. The appending operation affects the energy of the output sequence but gives no influence to the shape of it. The restriction of prime length give a more severe limitation to the DCT. However, we may utilize some application properties to avoid this restriction. For example, DCT is most widely used in image coding. One problem facing the DCT coding is the blocking effect [311. The overlap method is one of the remedies for this problem [31]. We may utilize the overlap method to append pixels from non-prime length to attain prime length, and hence avoid the restriction of prime length by solving the blocking effect. where "W" is assumed to be exp ( -j277/N) . To formulate (Al) as a cyclic convolution [17], the periodic property of " W N = 1" and the 1/0 data permutations based on the structure of Galois Field are utilized.
If N is a prime number, there exists some number "g," not necessarily unique, such that there is a one-to-one Similarly, if N is a prime number, the mapping relationship defined in (A2) can also be applied to reformulate (B2) and (B3). Therefore, (B2) and (B3) can be written with i and k as powers of the primitive element "g." j = ghodulo N .
(A2)
Because i and k take on the value zero which is not a power of "g," the zero frequency component must be treated specially, i.e.,
In the following, "g"' denotes the result of "g' modulo N operation" for short. The DFT in (Al) will be rewritten with i and k as the powers of a primitive element "R." Because i and k take on the value zero which is no; a To replace i and k by "g"' and " g k " , (B7) can be finally
= l
To replace i, k by "gl", " g k " and introduce a sequence 
