Abstract-An efficient approach to design very large scale integration (VLSI) architectures and a scheme for the implementation of the discrete sine transform (DST), based on an appropriate decomposition method that uses circular correlations, is presented. The proposed design uses an efficient restructuring of the computation of the DST into two circular correlations, having similar structures and only one half of the length of the original transform; these can be concurrently computed and mapped onto the same systolic array. Significant improvement in the computational speed can be obtained at a reduced input-output (I/O) cost and low hardware complexity, retaining all the other benefits of the VLSI implementations of the discrete transforms, which use circular correlation or cyclic convolution structures. These features are demonstrated by comparing the proposed design with some of the recently reported schemes.
I. INTRODUCTION
T HE discrete sine transform (DST), along with the discrete cosine transform (DCT), represent the key functions used in many signal and image processing applications, especially in transform coding. For images with high correlation, the DCT yields better results; however, for images with a low correlation of coefficients, the DST yields lower bit rates [2] . The DST is signal independent and represents a good approximation of the statistically optimal Karhunen-Loeve transform [1] . The DST constitutes the basis of the recursive block coding technique [2] and is used in a fast implementation of lapped orthogonal transforms [3] .
Since the DST is computationally intensive, the derivation of new efficient algorithms for its parallel very large scale interation (VLSI) implementation is highly desirable. The data movement and transfer play an important role in determining the efficiency of a VLSI implementation of the hardware algorithms [4] . This explains why the use of cyclic convolution and circular correlation structures provides high computing speed, low computational complexity, and low I/O bandwidth, as have already been shown for the discrete Fourier transform (DFT) [5] and for the DCT [6] . Due to their simple and regular data flow and their easy implementation through modular and regular hardware techniques, such as the distributed arithmetic [7] and systolic arrays [8] , the conversion of the DST into a cyclic convolution or a circular correlation structure leads to an efficient solution for its VLSI implementation.
In this paper, we propose a new input sequence and appropriate index mappings to arrive at an efficient conversion of a prime-length DST into two parallel circular correlation structures of one half of the original length. Substantial improvement in the processing speed of the VLSI realization is thus obtained. This realization preserves all the advantages reported in [6] for the DCT. The two circular correlation structures have the same structure and length; only the control tags and the input and output sequences are different. Their data-dependence graphs can be mapped into systolic arrays, as shown in [9] . The systolic array implementations can be efficiently unified using the method proposed in [10] . There are some differences in the sign that are efficiently managed using the tag control scheme [11] .
We can obtain a significant speed improvement with a slight increase in the hardware complexity compared with that of the schemes in [6] , [12] , and [13] , preserving all the advantages of architectural topology, input-output (I/O) cost, and computational complexity of the VLSI implementations of the discrete transforms that use systolic arrays, based on circular correlation structures.
II. NEW ALGORITHM FOR THE DST
The DST of the input sequence is defined as [1] (1)
where
. If the transform length is a prime-number greater than 2, we can introduce a new input sequence, which is defined as (2) 1053-587X/02$17.00 © 2002 IEEE Using appropriate permutations of the new sequences, we can decompose the computation of the DST into two half-length circular correlations of the same structure as follows: (3) for (4) and (5) where the new output sequences and are defined as if otherwise (6) if otherwise.
We have used the following index mappings to reorder the input and output samples:
(10)
and (12)
where represents the primitive root of the Galois field of the indices, and represents modulo . Now, we can decompose the computation of the DST into two circular correlations, which are defined as (14) for (15) Using the equation (16) the symmetry property of the sine function, and taking into consideration (6) and (7), we finally obtain sgn (17) sgn for (18) and
The sign functions sgn and sgn in (17) and (18) are defined as sgn (21) sgn sgn (22) Equations (17) and (18) represent two half-length circular correlations, having the same length and structure (excepting the sign), and they can be concurrently computed. The differences in the sign can be easily handled using an appropriate control mechanism known as the tag control scheme [11] . Due to the fact that they have the same form and length, we can obtain a significant hardware reduction by using an appropriate hardware sharing technique [10] .
III. EXAMPLE
In order to illustrate the proposed approach, we use an example of the DST with length and the primitive root . In this case, we can compute the two circular correlations of length 3 using (17) and (18) where . The sign of the term in (23) and (24) can be analytically computed using the sign functions defined in (21) and (22), respectively, where sgn is used to determine the sign of the sine terms in (23) and sgn for the sign of the sine terms in (24) .
The output samples are computed as follows:
and (27) IV. ALGORITHM ANALYSIS AND IMPLEMENTATION CONSIDERATIONS
The data dependencies, data operations, and the control signals used in the new systolic implementation can be easily obtained from the data-dependence graphs of Fig. 1(a) . The functions of the nodes are described in Fig. 1(b) .
In order to estimate the speed performance and the parallelism involved in the computation of the proposed algorithm, we can use the critical computing path concept, which represents the longest path necessary for the signal to move from the input to the output in the data-dependence graphs. If we choose a systolic array implementation paradigm, then the time necessary for the signal to pass through the critical computing path equals , which is one half of that required by the schemes given in [6] , [12] , and [13] . Due to the cyclic property of the input sequence , we can overlap the first two terms of an input data sequence with the last two terms of the previous data sequence and thus reduce the average computing time from to . Using the dependence graph-based synthesis procedure [9] and the tag control scheme [11] , we can obtain two linear systolic arrays of the same structure and length, having a reduced number of I/O channels placed at the two ends of the array. Only the number of control tag lines is dependent on the length of the array, but we have only such one-control-bit lines. We can further obtain a significant reduction in the hardware complexity by implementing the two VLSI structures on a single systolic array using the hardware sharing technique presented in [10] , thus doubling the speed using almost the same I/O cost as in [6] , [12] and [13] but with a reduced hardware complexity.
The systolic array that represents the hardware core of the proposed DST VLSI array is presented in Fig. 2 , where we have used the following notations:
, samples of the th auxiliary output sequences; samples of the th auxiliary input sequence; signifies . Excepting the control lines, the number of input and output channels in our design is independent of the transform length . This allows us to easily extend our design to large values of that are prime numbers, which cannot be achieved by the solution proposed in [15] . Due to its regularity, modularity, simplicity, and local connections, the proposed systolic array algorithm is well suited for VLSI implementation.
V. COMPARISON AND DISCUSSION
The VLSI algorithms proposed for block-orthogonal transforms can be broadly classified into the following groups: -algorithms based on recursive computation; -fast algorithms based on butterfly structures; -algorithms based on direct computation through matrix decomposition; -algorithms using cyclic convolution or circular correlation computational structures. The algorithms from each category have specific advantages and drawbacks, such that the selection of the appropriate algorithm and implementation depend on the specific application, the speed, the cost, the I/O requirements, and the transform length. They are implemented using different implementation styles so that significant improvements can also be made at the implementation level, especially at the arithmetic operator level. Because of the hardware limitations, in practical applications, only small block-transform sizes (typically 8 8 or 16 16) are used, causing blocking artifacts. With the rapid pace of advances in the VLSI technology, we can expect that soon, it would be possible to transform a whole image, and not just a small block, in a VLSI structure without blocking artifacts, but most of the VLSI algorithms and architectures are not appropriately designed for such an extension of the transform-length . For large values of , the VLSI implementations of these algorithms tend to be communication bound, and there is a waste of the hardware due to the I/O bottleneck. Hence, special techniques have to be used to recast the VLSI algorithms as stated in [6] and [16] to reduce the I/O bandwidth and the number of I/O channels. In addition, for the same reason, the transform kernels are generated inside the array as in [12] , with the price of an increase in the hardware complexity. Thus, the I/O problems [6] can seriously limit the applicability of a VLSI solution for practical applications, but sometimes, these problems are not taken into consideration.
The VLSI implementations based on recursive algorithms [17] - [20] are suitable for applications to low-cost consumer products due to their compact and simple structure, but they are very slow. These structures are difficult to pipeline due to their inherent recursive structure and do not allow two-level pipelining so that the hardware complexity reduction is traded off with the speed performance. These structures also suffer from the numerical inaccuracy and instability problems that can lead to a bit-width explosion.
Although efforts have been made recently to improve the regularity of implementations that use fast algorithms based on butterfly structures [4] , [21] , [22] , they are characterized by a data flow with a low degree of regularity and modularity that has a direct impact on their VLSI implementations, resulting in complex data routing or address computation, complex layout, timing, and reliability problems that can severely limit the speed performance and their expandability. They are restricted to transform lengths that are only powers of and may not be fully operational due to the I/O communication restrictions, thus causing an inefficient use of the hardware. Even though they are characterized by a small number of multipliers due to their low arithmetic complexity, they tend to be replaced by time-recursive structures in low-cost products [19] , whereas for high-speed applications, they suffer from the I/O bandwidth bottleneck.
The VLSI architectures that use algorithms based on direct computation through matrix decomposition [15] , [23] make use of a different approach to split the matrix-vector formulation of the trigonometric transforms into two half-length matrix-vector products. This approach does not have the same regularity feature as provided by the proposed algorithm, and the number of transform kernels is instead of , which leads to high I/O costs and/or complicated data flow. For example, in [15] , the I/O problems are left unsolved, and thus, the resulting VLSI array cannot be easily extended to higher values of .
The design approaches that are based on appropriate reformulation of the transforms into cyclic convolution or circular correlation [5] , [6] , [13] , [24] , due to their regular and simple data flow, lead to simple and efficient hardware implementations with low I/O complexity, lower computational complexity, good architectural topology, and a high degree of embedded parallelism. In this paper, we have tried to further develop the advantages offered by this design technique. The implementation styles could be different, and for a chosen style, further optimizations can be made at the implementation level or even at the layout level. However, we feel that the most significant improvements can be achieved at the algorithmic level. In this paper, the systolic array paradigm has been used to illustrate the advantages offered by our approach.
In the proposed design as compared with [14] , whereas the number of multiplications required is the same, the number of processing elements has been reduced to from . The implementation style used in [14] is the distributedarithmetic (DA-style), which cannot be easily extended for large values of , due to the fact that the sizes of the read-only memories (ROMs) increase exponentially with . These structures are difficult to pipeline due to the feedback used in the ROM and accumulator (RAC) structures. The proposed method doubles the speed using almost the same hardware complexity.
Comparing the proposed design with the one presented in [15] , we see that the number of multipliers is reduced from to , where is the transform length and the number of adders from to . However, the most significant improvement is in the I/O cost. Comparing the matrix-vector product formulations of the two algorithms, it can be seen that the number of transform kernels has been significantly reduced from to . In addition, the number of the input channels is reduced from to and the number of output channels from to , where and are, respectively, the number of bits used to represent the input and output data samples. In [15] the latency time is better, but the I/O problems were left unsolved, and hence, additional latency and hardware complexity have to be taken into consideration. In addition, the throughput is more important than latency, and the initial delay can be neglected in the case of continuous streams. On the other hand, the product , where is the number and the bandwidth of the input channels, directly depends on the volume of data to be loaded into the VLSI structure, which in this case is instead of , and this is a feature of the algorithm and not of its implementation. Thus, if we try to reduce the number of the input channels by some implementation techniques, it is necessary to increase their bandwidth. Thus, due to the pin number and bandwidth limitations, and the necessity of re-evaluating the parameters involved for different data length, this structure is difficult to extend for larger values of .
Compared with [6] and [13] , the speed is doubled in the proposed design, and the arithmetic complexity is reduced while maintaining the circular correlation structure with all its advantages in terms of the architectural topology, the I/O cost, and the simplicity in hardware implementation.
The time-recursive structure [20] and the systolic arrays based on the Clenshaw's recurrence formula [25] , [26] have one half of the throughput and do not allow two-level pipelining due to the feedback, with a comparable hardware complexity for DCT or DST. Tables I and II give comparisons of the proposed VLSI systolic array design with some of the recently reported schemes with regard to the hardware complexity, speed, and other features. In these tables, values only for DCT or/and DST are included for the case of the unified structures.
VI. CONCLUSIONS
Efficient schemes for the conversion of the discrete transforms into cyclic correlation or convolution structures are now available and have been found to be very efficient for hardware implementation using VLSI technology. In this paper, a new design approach for a systolic array implementation of the discrete sine transform using circular correlation structures, based on an efficient way to convert the DST into two circular correlation structures, has been presented. Using two parallel circular correlation structures with the same structure and length and efficiently unifying them, a substantial improvement in the processing speed can be obtained with reduced hardware complexity and low I/O cost. The improvement in the processing speed as well as the low hardware complexity has been demonstrated by comparing these features with those of the recently reported schemes. The proposed design preserves all the other advantages related to architectural topology, computational complexity and I/O cost, specific to systolic array implementations using circular correlation, and cyclic convolution structures. 
