A new VLSI algorithm and its associated systolic array architecture for a prime length type IV discrete cosine transform is presented. They represent the basis of an efficient design approach for deriving a linear systolic array architecture for type IV DCT. The proposed algorithm uses a regular computational structure called pseudoband correlation structure that is appropriate for a VLSI implementation. The proposed algorithm is then mapped onto a linear systolic array with a small number of I/O channels and low I/O bandwidth. The proposed architecture can be unified with that obtained for type IV DST due to a similar kernel. A highly efficient VLSI chip can be thus obtained with good performance in the architectural topology, computing parallelism, processing speed, hardware complexity and I/O costs similar to those obtained for circular correlation and cyclic convolution computational structures. 
Introduction
Type IV DCT and DST (DCT-IV and DST-IV) were first introduced by Jain as members of the sinusoidal family of unitary transforms [1] . The DCT-IV and DST-IV have found important applications in signal processing, the most recent being in the fast implementation of lapped orthogonal transforms for signal and image coding, perfect reconstruction cosine-modulated filter banks and also in some filter-banks used in digital audio compression [2] [3] [4] . Since their proposal, several software implementations of the type IV DCT have been presented, but only a few efficient hardware solutions have been implemented until now. For example, in [5] and [6] new algorithms are proposed for type-IV DCT transform that are appropriated for a software implementation. DCT-IV and DST-IV are computational intensive algorithms and the programmable general-purpose architectures do not meet the speed requirements of many real-time applications. Thus, it is useful to develop efficient hardware implementations using the VLSI technology. In order to do this it is necessary to appropriately modify the existing algorithms or, more efficiently, to derive new VLSI algorithms in order to meet the requirements of a real-time application. It is already well known that the efficiency of a VLSI algorithm is based more on the communication complexity than on the computational one. The rationale for this fact is that the data flow plays a central role in designing a VLSI architecture. Thus, the use of regular and modular computational structures such as cyclic convolution and circular correlation [7] [8] [9] [10] has been proved to offer good implementation solutions for the discrete transforms using systolic arrays [11] and distributed-arithmetic [12] . They lead to efficient VLSI implementations with low I/O cost and reduced hardware complexity, high speed and a regular and modular hardware structure. In this paper we show that another regular computational structure that was called pseudo-band correlation can be also used efficiently with the systolic array architectural paradigm to obtain similar good performance in the VLSI implementation of type IV DCT. In this paper, a new VLSI algorithm that can be efficiently implemented using a linear systolic array architecture for a prime-length type IV Discrete Cosine Transform is presented. The proposed systolic algorithm uses an appropriate computational structure called pseudo-band correlation that can be efficiently computed using a linear systolic array and an appropriate control structure based on the tag-control scheme. The proposed approach is using two auxiliary input and output sequences and an appropriate reordering of these sequences based on the properties of the Galois Field of the indexes. It is shown that is possible to obtain advantages similar to those of the systolic array implementations using circular correlation as high speed, low I/O cost and reduced hardware complexity with a high regularity and modularity to obtain an efficient VLSI architecture. The design technique is based on the data-dependence graph-based procedure [13] . Thus, we can obtain a linear systolic array that can be efficiently controlled using a control technique specifically designed for systolic arrays [14] . The pre-processing stage is used to convert the input sequence using some multiplications, a recursive computation and data reordering operations into an appropriate auxiliary one that can be processed using the pseudo-band correlation structure. The post-processing stage is used to convert the auxiliary output sequence into the final output one using some recursive computations and multiplications together with data reordering operations. The computation complexity of the operations implemented in the pre-and post-processing stages is of O(N) as opposed to that of the hardware kernel that implements the circular correlation operation which is O(N 2 ). The tag-control scheme is used to control the loading and draining of the data sequences into the internal registers of the systolic array using only I/O channels placed at the two ends of the linear array. The same control mechanism can be used to select the operations and the sign of the operands in each processing element. Thus, using an appropriate reformulation technique and choosing a linear systolic array as a VLSI architecture paradigm, we can obtain high computing speed with a low I/O cost and hardware complexity for a prime-length type IV DCT, together with all the other advantages of the linear systolic array implementations of the circular correlation structures as regularity, modularity and local connectivity with I/O channels placed only at the two ends of the array. Using the tag-control scheme we can appropriately select the operations in each processing element and control the loading and draining of the data into/from internal registers. Moreover, due to the fact that the computational kernel is similar to that of the type IV DST, an efficient unified chip for type IV DCT and DST can be obtained using this approach. The paper is organized as follows: in Section 2 a new algorithm for type IV DCT is presented followed by an example in Section 3. In Section 4 the hardware realization using the systolic array architectural paradigm is described. In Section 5 the conclusions of the paper are presented.
Systolic Algorithm for 1-D Type IV DCT
For the real input sequence ( ) : = 0 1 N − 1, the 1-D type IV DCT (DCT-IV) is defined as:
In the following text, to simplify our presentation, we will drop the constant coefficient 2 N from the definition of the DCT-IV. We will add at the end of the VLSI array a multiplier to scale the output sequence with this constant. In order to reformulate relation (1) as a band-correlation form we introduce some auxiliary sequences and the proprieties of the Galois Field of indexes to appropriately permute the input and output sequences. The output sequence {X ( ) : 1 2 N − 1} can be computed as follows:
The auxiliary input sequence { ( ) : = 0 N − 1} is defined as follows:
The new auxiliary output sequence {T ( ) : = 1 2 N − 1} can be computed as a band-correlation, if the transform length N is a prime number, as follows:
where N denotes the result of modulo N.
We have used the properties of the Galois Field of indexes to convert the computation of the auxiliary output sequence {T ( ) : = 1 2 N − 1} as a circular correlation.
An Example
To illustrate our approach, we will consider an example of 1-D type IV DCT with the length N = 11 and the primitive root = 2. First we compute the two auxiliary input sequences: 
Then, we recursively compute the following auxiliary input sequence { ( ) : = 0 N − 1}: 
We can write (9) in the matrix-vector form as: (5) (10) (9) (8) (5) (10) (9) (7) (5) (10) (9) (7) (3) (10) (9) (7) (3) (6) (9) (7) (3) (6) (1) (7) (3)
where we used ( ) as a short for 2 sin(2 α) and the sign of the items in relation (9) is given by the following matrix:
01 01 11 01 11 01 11 01 11 11 11 01 11 11 11 00 10 10 00 00 11 11 11 01 11 10 10 00 00 00 10 00 10 10 00 00 10 00 10 00 11 01 01 01 11 00 00 00 00 00
where
• the first bit denotes the sign before the brackets,
• the second bit denotes the sign inside the brackets,
where the "1" bit indicates the minus sign (the first bit) and the subtraction operation (the second one).
From equation (14) we can see that all the elements along the secondary diagonal of the matrix in (14) are the same. We will call this regular computational structure band-correlation. As there are differences in sign as can be seen from the SIGN matrix, this computational structure will be called pseudo band-correlation structure. Note that the computation structure given by equation (14) is similar to that obtained for type IV DST. Thus, an efficient unified VLSI architecture can be obtained. Then, we compute the following auxiliary output sequence as follows:
Finally, the output sequence {X ( ) : = 1 2 N − 1} can be computed recursively as follows:
for = 1 10.
Hardware Realization
In order to use the method presented in [13] we need the recursive form of the equation (9) . Thus, we can obtain the data dependence graph of the proposed algorithm that clearly shows the data dependencies, data operations and the control signals involved in the proposed graph. Using the proposed VLSI algorithm and the data dependence graphbased procedure presented in [13] we can map the VLSI algorithm into the systolic array from Figure 1 . The processing elements PEs have the function shown in Figure 1b . The PEs from the kernel module that implements the pseudo-band correlation structure and represents the hardware core of the VLSI architecture, execute the operations from relation (9) . In order to obtain a linear structure with I/O channels at the boundary PEs, we have used an appropriate control scheme known in the literature as the tag control scheme [14] . We have used two control signals: the control signal "tc" is used to select the correct operand in the operations executed by PEs and the control signal "sign" is used to select the right operand and the right sign in the operations of PEs. The role of these signals can be seen in Figures 1a and 1b . Apart from the hardware kernel the overall architecture has a pre-processing and a post-processing stage. The preprocessing stage has been introduced to obtain the appropriate form for the auxiliary input sequences that are fed into the hardware kernel. It implements the equations (6, 7, 8) . The pre-processing stage has two multiplication units that implement equation (8), a subtraction module that implements equations (6, 7) followed by a permutation one. Thus, the appropriate input sequence is permuted and used to generate the required combination of data operands. The role of the post-processing stage is to obtain the final output sequence in a natural order from the auxiliary output sequence arranged in a permuted order. It implements the equations (3, 4, 5) and realizes the permutation of the auxiliary output sequence into the right order. The post-processing stage contains a permutation block consisting of a multiplexer and some latches to permute the auxiliary output sequence in a fully pipelined mode. The average computation time of the proposed VLSI array for an N-point type IV DCT is (N − 1)Tcycles and the number of multipliers is only (N − 1)/2 + 2. Hence high processing speed with low hardware complexity can be obtained. As compared with the hardware implementation of [15] , the hardware complexity is reduced by half with a cost of a larger latency.
Conclusion
In this paper, a new VLSI algorithm together with its associated systolic array based architecture for a prime length type IV DCT was presented. The proposed algorithm is based on using a regular computational structure called pseudoband correlation that was efficiently implemented using the systolic array architecture paradigm having low I/O costs, a high degree of parallelism and good architectural topology with a high degree of regularity and modularity. Thus, a new systolic array with high computing speed and parallelism, low computational and I/O costs was obtained for a prime-length type IV DCT. Moreover, the proposed architecture can be efficiently unified with that obtained for type IV DST due to the existence of a similar kernel.
