Using a specific input-restructuring sequence, a new VLSI algorithm and architecture have been derived for a high throughput memory-based systolic array VLSI implementation of a discrete cosine transform. The proposed restructuring technique transforms the DCT algorithm into a cycle-convolution and a pseudo-cycle convolution structure as basic computational forms. The proposed solution has been specially designed to have good fixed-point error performances that have been exploited to further reduce the hardware complexity and power consumption. It leads to a ROM based VLSI kernel with good quantization properties. A parallel VLSI algorithm and architecture with a good fixed point implementation appropriate for a memory-based implementation have been obtained. The proposed algorithm can be mapped onto two linear systolic arrays with similar length and form. They can be further efficiently merged into a single array using an appropriate hardware sharing technique. A highly efficient VLSI chip can be thus obtained with appealing features as good architectural topology, processing speed, hardware complexity and I/O costs. Moreover, the proposed solution substantially reduces the hardware overhead involved by the pre-processing stage that for short length DCT consumes an important percentage of the chip area.
Introduction
The discrete cosine transform (DCT) and discrete sine transform (DST) [1] [2] [3] are key elements in many digital signal processing applications being good approximations to the statistically optimal Karhunen-Loeve transform [2, 3] . They are used especially in speech and image transform coding [3, 4] , DCT-based subband decomposition in speech and image compression [5] , or video transcoding [6] . Other important applications are: block filtering, feature extraction [7] , digital signal interpolation [8] , image resizing [9] , transform-adaptive filtering [10, 11] , and filter banks [12] .
The choice of DCT or DST depends on the statistical properties of the input signal. For low correlated input signals, DST offers a lower bit rate [3] . For high correlated input signals, DCT is a better choice.
Prime-length DCTs are critical for a prime-factor algorithm (PFA) technique to implement DCT or DST. PFA technique can be used to significantly reduce the overall complexity [13] that results in higher-speed processing and reduced hardware complexity. PFA can split a N = N 1 × N 2 point DCT, where N1 and N2 are mutually primes, into a two-dimensional N 1 × N 2 DCT. DCT is then applied for each dimension, and the results are combined through input and output permutations.
The DCT and DST are computationally intensive. The general-purpose computers usually do not meet the speed requirements for various real-time applications and also the size and power constraints for many portable systems 2 EURASIP Journal on Advances in Signal Processing although many efforts have been made to reduce the computational complexity [14, 15] . Thus, it is necessary to reformulate or to find new VLSI algorithms for these transforms. These hardware algorithms have to be designed appropriately to meet the system requirements. To meet the speed and power requirements, it is necessary to appropriately exploit the concurrency involved in such algorithms. Moreover, the VLSI algorithm and architecture have to be derived in a synergetic manner [16] .
The data movement into the VLSI algorithm plays an important role in determining the efficiency of a hardware algorithm. It is well known that FFT, which plays in important role in the software implementation of DFT, is not well suited for VLSI implementation. This is one explanation why regular computational structures such as cyclic convolution and circular correlation have been used to obtain efficient VLSI implementations [17] [18] [19] using modular and regular architectural paradigm as distributed arithmetic [20] or systolic arrays [21] . This approach leads to low I/O cost and reduced hardware complexity, high speed and a regular and modular hardware structure.
Systolic arrays are an appropriate architectural paradigm that leads to efficient VLSI implementations that meet the increased speed requirements of real-time applications through an efficient exploitation of concurrency involved in the algorithms. This paradigm is well suited to the characteristics of the VLSI technology through its modular and regular structure with local and regular interconnections.
Owing to regular and modular structure of ROMs, memory-based techniques are more and more used in the recent years. They offer also a low hardware complexity and high speed for relatively small sizes compared with the multiplier-based architectures. Multipliers in such architectures [22] [23] [24] [25] consume a lot of silicon area, and the advantages of the systolic array paradigm are not evident for such architectures. Thus, memory-based techniques lead to cost-effective and efficient VLSI implementations.
One memory-based technique is the distributed arithmetic (DA). This technique is largely used in commercial products due to its efficient computation of an innerproduct. It is faster then multiplier-based architecture due to the fact that it uses precomputed partial results [26, 27] . However, the ROM size increases exponentially with the transform length, even if a lot of work have been done to reduce its complexity as in [28] , rendering this technique impractical for large sizes. Moreover, this structure is difficult to pipeline.
The other main technique is the direct-ROM implementation of multipliers. In this case, multipliers are replaced with ROlM-based look-up table (LUT) of size 2 L where L is the word length. The 2 L pre-computed values of the product for all possible values of the input are stored in the ROM.
In [16, 19] , another memory-based technique that combines the advantages of the systolic arrays with those of memory-based designs is used. This technique will be called memory-based systolic arrays. When the systolic array paradigm is used as a design instrument, the advantages are evident only when a large number of processing elements can be integrated on the same chip. Thus, it is very important to reduce the PE chip area using small ROMs as it is proposed in this technique.
In this paper, a new parallel VLSI algorithm for a primelength Discrete Cosine Transform (DCT) is proposed. It significantly simplifies the conversion of the DCT algorithm into a cycle and a pseudo-cycle convolution using only a new single input restructuring sequence. It can be used to obtain an efficient VLSI architecture using a linear memory-based systolic array. The proposed algorithm and its associated VLSI architecture have good numerical properties that can be efficiently exploited to lead to a low-complexity hardware implementation with low power consumption. It uses a cycle and a pseudo-cycle convolution structure that can be efficiently mapped on two linear systolic arrays having the same form and length and using a small number of I/O channels placed at the two extreme ends of the array. The proposed systolic algorithm uses an appropriate parallelization method that efficiently decomposes DCT into a cycle and a pseudo-cycle convolution structure, obtaining thus a high throughput. It can be efficiently computed using two linear systolic arrays and an appropriate control structure based on the tag-control scheme. The proposed approach is based on an efficient parallelization scheme and an appropriate reordering of these sequences using the properties of the Galois Field of the indexes and the symmetry property of the cosine transform kernel. Thus, using the proposed algorithm, it is possible to obtain a VLSI architecture with advantages similar to those of systolic array and memory-based implementations using the benefit of the cycle convolution. Thus a high speed, low I/O cost, and reduced hardware complexity with a high regularity and modularity can be obtained. Moreover, the hardware overhead involved by the preprocessing stage can be substantially reduced as compared to the solution proposed in [16] .
The paper is organized as follows. In Section 2, the new computing algorithm for DCT encapsulated into the memory-based systolic array is presented. In Section 3, two examples for the particular lengths N = 7 and N = 11 are used to illustrate the proposed algorithm. In Section 4, aspects of the VLSI design using the memory-based systolic array paradigm and a discussion on the effects of a finite precision implementation are presented. In Section 5, the quantization properties of the proposed algorithm and architecture are analyzed analytically and by computer simulation. In Section 6, comparisons with similar VLSI designs are presented together with a brief discussion on the efficiency of the proposed solution. Section 7 contains the conclusions.
Systolic Algorithm for 1D DCT
For the real input sequence x(i) :
EURASIP Journal on Advances in Signal Processing
To simplify the presentation, the constant coefficient √ 2/N will be dropped from (1) that represents the definition of DCT; a multiplier will be added at the end of the VLSI array to scale the output sequence with this constant.
We will reformulate relation (1) as a parallel decomposition based on a cycle and a pseudo-cycle convolution forms using a single new input restructuring sequence as opposed to [16] where two such auxiliary input sequences were used. Further, we will use the proprieties of DCT kernel and those of the Galois Field of indexes to appropriately permute the auxiliary input and output sequences.
Thus, we will introduce an auxiliary input sequence {x a (i) : i = 0, 1, . . . , N − 1}. It can be recursively defined as follows:
Using this restructuring input sequence, we can reformulate (1) as follows:
The new auxiliary output sequence {T(k) : k = 1, 2, . . . , N −1} can be computed in parallel as a cycle and pseudocycle convolutions if the transform length N is a prime number as follows:
where
with
where x N denotes the result of x modulo N and g is the primitive root of indexes.
We have also used the properties of the Galois Field of indexes to convert the computation of DCT as a convolution form.
Examples
To illustrate our approach, we will consider two examples for 1D DCT, one with the length N = 7 and the primitive root g = 3 and the other with the length N = 11 and the primitive root g = 2.
3.1. DCT Algorithm with Length N = 7. We recursively compute the following input auxiliary sequence {x a (i) : i = 0, . . . , N − 1} as follows:
Using the auxiliary input sequence {x a (i) :
we can write (6) in a matrix-vector product form as
c (1) c(3)
EURASIP Journal on Advances in Signal Processing
where we have noted c(k) for 2 · cos(2kα). The index mappings δ(i) and γ(i) realize a partition into two groups of the permutation of indexes {1, 2, 3, 4, 5, 6}. They are defined as follows:
The functions ξ(k, i) and ζ(i) define the sign of terms in (10), respectively, as follows: Finally, the output sequence {X(k) : k = 1, 2, . . . , N − 1} can be computed as
DCT Algorithm with
Length N = 11. We recursively compute the following input auxiliary sequence {x a (i) : i = 0, . . . , N − 1} as follows:
T (4) T (8) T (6) T(10) (4) c (3) c (5) c (1) c(2)
T (7) T (3) T (5) T(1) 
The functions ξ(k, i) and ζ(i) define the sign of terms in (10) and (11) Finally, the output sequence {X(k) : k = 1, 2, . . . , N − 1} can be computed as follows:
Hardware Realization of the VLSI Algorithm
In order to obtain the VLSI architecture for the proposed algorithm, we can use the data-dependence graph-method (DDG). Using the recursive form of (6), we have obtained the data dependence graph of the proposed algorithm. The data dependence graph represents the main instrument in our design procedure that clearly puts into evidence the main elements involved in the proposed algorithm. Using this method, we can map the proposed VLSI algorithm into two linear systolic arrays. Then, a hardware sharing method to unify the two systolic arrays into a single one with a reduced complexity can be obtained as shown in Figure 1 . Using a linear systolic array, it is possible to keep all I/O channels at the boundary PEs. In order to do this, we can use a tag based control scheme. The control signals are also used to select the correct sign in the operations executed by PEs. The PEs from the cycle and pseudo-cycle convolution modules, that represent the hardware core of the VLSI architecture, execute the operations from relation (6) .
The structure of the processing elements PEs is presented in Figures 2(a) and 2(b) .
Due to the fact that (6) have the same form and the multiplications can be done with the same constant in each processing element, we can implement these multiplications with only one biport ROM having a dimension of 2 L/2 words as can be seen in Figures 2(a) and 2(b) .
The function of the processing element is shown in Figures 3(a) and 3(b) .
The bi-port ROM serves as a look-up table for all possible values of the product between the specific constant and a half shuffle number formed from the bits of the input value. The two partial values are added together to form the result of the multiplication. One of the partial sums is hardware shifted with one position before to be added as shown in Figures 2(a) and 2(b) . This operation is hardwired in a very simple manner and introduces no delay in the computation of the product. These bi-port ROMs are used to significantly reduce the hardware complexity of the proposed solution at about a half. The two results of the product are one after the other added with y 1i and y 2i to form the two results of each processing element. The control tag tc appropriately select the input values that have to be apply to the biport ROM. The sign of the input values are selected using the control tags tc1 as shown in Figures 2(a) and 2(b) . Excepting the adder at the end of bi-port ROM, all the other adders are carry-ripple adders that are slow and consume less area. As can be seen in Figure 2 , the processing element is implemented as four pipeline stages. The clock cycle T is determined by max(T Mem , T A ). The actual value of the cycle period is determined by the value of the word length L and the implementation style for ROM and adders.
In order to implement (13) and to obtain the output sequence in the natural order a postprocessing stage has to be included as shown in Figure 4 . The postprocessing stage contains also a permutation block. It consists of a multiplexer and some latches and can permute the auxiliary output sequence in a fully pipelined mode. Thus, the I/O data permutations are realized in such a manner that there is no time delay between the current and the next block of data.
The preprocessing stage is used to obtain the appropriate form and order for the auxiliary input sequences. It has been introduced to implement (3) and (4) and to appropriately permute the auxiliary input sequence. The preprocessing stage contains an addition module that implements (4) followed by a permutation one. Thus, the input sequence is processed and permuted in order to generate the required combination of data operands.
Discussion on Quantization Errors of a Fixed-Point Implementation
The proposed algorithm and its associated VLSI implementation have good numeric properties as it results from Figures  10 and 11 . In our analysis, we have compared the numerical properties of our solution for a DCT VLSI implementation 6 EURASIP Journal on Advances in Signal Processing 
x a 1 (6.1) x a 1 (2, 5)
x a 1 (6 1)
x a 1 (3, 4)
x a 3 (3, 4)
x a 2 (3, 4)
x a 2 (6, 1) x a 2 (2, 5) 
The structure of the first processing element PE in Figure 1 . The structure of the other processing elements PE in Figure 1 . 
EURASIP Journal on Advances in Signal Processing
with lengths N = 7 and N = 11 with those of the algorithm proposed by Hou [29] and with a direct-form implementation [30] .
Fixed-Point Quantization Error Analysis.
We will analyze the fixed-point error for the kernel of our architecture represented by the VLSI implementation of (6). This part contributes decisively to the hardware complexity and the power consumption of the VLSI implementation of the DCT. We will show analytically and by computer simulation that it has good quantization properties that can be exploited to further reduce the hardware complexity and the power consumption of our implementation. We can write (6) in a generic form
Let
where u(i − k) is the fixed-point representation of the input data and Δu(i − k) is the error between the actual value and its fixed-point representation. Thus
We suppose that the process that governs the errors is linear and, thus, we can utilize the superposition property. Thus, (20) become
We can write
The sums (22) for all combinations of {u 0 , u 1 , . . . , u L−1 } are computed using a floating-point representation for coefficients cos(ψ(i) × 2α); then, the result is truncated and stored in a ROM. Thus, we can use the following error model for the constant multiplication (22) , where u is the quantization of the input and Q(·) is the truncation operator We can write
where: Δ u cos(i) is the truncation error. Thus, we can write
We will compute the second-order statistics σ 2 T of the error term. This parameter describes the average behavior of the error and is related to MSE (mean-squared error) and SNR (signal-to-noise ratio).
We assume that the errors are uncorrelated and with zero mean.
We have cos (i) It results
It can be easily seen that
We can assume that
We will verify the relation (26) by computer simulation using SNR parameter [31] . In performing the fixed-point round-off error analysis, we will use the following assumptions:
(i) the input sequence x(i) is quantized using L bits, (ii) the output of each ROM-based multiplier is quantized using M bits, (iii) the errors are uncorrelated with one another and with the input, (iv) the input sequence x(i) is uniformly distributed between (−1, 1) with zero mean, (v) the round-off errors at each multiplier is uniformly distributed with zero mean.
The SNR parameter is computed as
where σ 2 O is the variance of the output sequence and σ
2
T is the variance of the quantization error at the output of the transform.
Using the graphic representation shown in Figures 7-9 , we can see that the computed values agree with those obtained from simulations. Thus, the computed values of SNR have similar values with those computed using relations (26) and (29) represented using snrDfcT plots. In Figure 7 , we have shown the dependence of the SNR in our solution on the transform length N = 7 and N = 11 as a function of the number of bits L and M used in the quantization of the input sequence and the output of each ROM-based multiplier, respectively, when M = L. In Figure 8 we show the dependence of the SNR for our solution with N = 7 as a function of L, when M = 10, 12, and 14, respectively, and in Figure 9 , we have the same dependence for our solution with the transform length N = 11.
Thus, we have obtained analytically and we have verified by simulation the dependence of the variance of the output round-off error with the quantized error of the input sequence σ 2 Δx . This is a significant result especially for our architecture as we can choose appropriately the value of L with L < M. It can be used to significantly reduce the hardware complexity of our implementation as the dimension of the ROM used in the implementation of a multiplier in our architecture is given by M · 2 L and increases exponentially with the number of bits L used to quantize the input u(i) for each multiplier. It can be seen that if L > M the improvement of the SNR is insignificant. Using these dependences, we can easily chose L significantly less than M.
Using the method proposed in [30] , where the analysis is made for the direct-form IDCT, we can also obtain for the direct-form DCT, the round-off error variance (σ 2
As compared with a direct-form method implementation, known to be robust to the fixed-point implementation and thus used by many chip manufactures [30] , the round-off error variance σ 2 T for the kernel of our solution given by relation (26) is significantly better as we will also see by computer simulations.
Note that in [30] the analysis is made for direct-form IDCT but it is similar for the direct-form DCT. In the case we are using hardwired multipliers instead of LUT based multipliers, there is another error term due to the quantization of multiplier coefficients cos(ψ(i) × 2α) of the form
where [Δ cos(ψ(i) × 2α)] is the quantization error of the multiplier coefficient. Thus, the quantization error will be significantly greater for a similar hardwired multiplier based architecture for DCT reported in [32] . It follows that the proposed ROM-based architecture will have better numerical properties as compared with similar hardwired multiplier based architectures for DCT. Let us also observe that instead of the quantization of the result of relation (22), we can store in the ROMs the rounded value of that result. The rounding error will be less than the truncation error. Thus, the numerical properties of the proposed solution can be further increased.
This result shows that the kernel of our implementation based on (6) has good quantization properties, the error due to a fixed-point representation being small, one of the best results for DCT implementations as will be shown also using computer simulations. Thus, our proposed algorithm and architecture is a very robust solution for a fixed point implementation of DCT. This property can be exploited to further reduce the hardware complexity and the power consumption of the main part of our architecture represented by the abovementioned kernel. This kernel has the main contribution to the hardware complexity of our architecture and to its power consumption.
Comparison with Other Relevant Implementations of DCT.
In our computer simulations in order to demonstrate the good quantization properties of the proposed solution in comparison with some relevant other ones, we have used several significant parameters as PSNR (peak-signal-to-noise ratio) and the following measures:
(i) overall MSE defined as
(ii) peak error defined as
The numbers for the root of MSE and the value of PPE for different values of the word length L when L = M are represented in the Figures 10 and 11 .
The values of PSNR are presented in Figure 12 in dB for different values of the word length L when M = L. It can be easily seen that the values for PSNR are better for our algorithm than those reported in Hou [29] and for the direct form. The obtained numerical properties of the proposed algorithm and its associated VLSI architecture can be exploited to significantly decrease the hardware complexity. Thus, in the proposed architecture, the dimension of the ROMs used to implement the multipliers depends exponentially on the word length L used for a fixed-point representation of the input operands. Consequently, we can use small-size ROMs with a short access time to reduce the hardware complexity and to improve the speed. It results that the overall hardware complexity will be significantly reduced due to these good numerical properties and also the power consumption.
This improvement is explained besides the inner properties of the proposed algorithm by two design decisions that will be explained below. First, we can increase the precision used in representing the data in relations (3) and (4) without significantly increasing the hardware complexity.
Then, we can use a high precision for the coefficients c(i) in the computing of the partial results stored in the ROMs that implement the multiplication with this coefficients.
Comparison
The throughput of the VLSI architecture that can be obtained using the proposed algorithm is 2/(N − 1) similar to [16, 33] , but doubled compared to [34] . The pipeline period (average computation time) is (N − 1)T/2 as in [16, 33] but the clock period T is significant different in these three cases. In the proposed design T = max(T Mem , T a ) is significantly less than in [16] , where T = T Mem + T a and in [33] , where T = T Mul + 3T a and [34] , where
We have used the following notations: T Mem -multiplication time, T a -adder time, and T Mem -access time. Thus, the proposed design is significantly faster.
If we compare with [24] , we can see that (10) can be computed in parallel and thus the throughput can be doubled with respect to [24] , where we do not have such a parallelization. Moreover, due to the fact that the two equations have a similar form and the same length, they can be mapped on the same linear systolic array with the ROMs used to implement the shared multipliers. Thus, a significant increase of the throughput can be obtained without doubling the hardware as compared with [24] . The number of control bits is significantly reduced for a double computation (10) as compared with [24] .
In order to illustrate the advantages of our proposal, we will consider a numerical example with the length N = 37 and the number of bits for the input of multipliers L = 10. In this case, the number of multipliers for our proposed solution [16, 34] is very small being 2 and 1. Instead, our proposed solution uses 576 words of ROM, [16] uses 1152 words, [28] uses 22 528 words, and [34] have 9 473 000 words. The number of adders is 57 for our solution, 77 for [16] , 54 for [33] , 107 for [34] , and only 20 for [24] . But [24, 34] have only a half of the throughput of [16, 33] and our solution. The number of bits for I/O channels is 88 for our proposed solution, 107 for [24] , and 71 for [16] . In [33, 34] , we have only 41, respectively, 20 bits for I/O channels.
The comparison between our solution and other VLSI implementations for DCT is presented in Table 1 .
Comparing the hardware complexity, we can see from Table 1 that the number of ROM words is half in our design as compared with [16] and significantly less than in [34] [24] (N − 1)/2 + 2 (N − 1)/2 + 2 2(N + 1) T mul + T a 1/(N − 1) (N − 1)T 7L + N for long length N. In Figure 13 , we have illustrated the variation of the number of ROM words with respect to the transform length N when L = 10. Also, in Figure 14 , we have represented the number of bits used by I/O channels with respect to the transform length N when also L = 12. Note that L for other design could be larger as our solution has better quantization properties. It can be seen that the number of bits for the I/O channels in our solution increases slightly with the transform length N. It is also another important aspect to note. The hardware complexity of the preprocessing stage is significantly reduced at about one half in our design as compared with [16] as shown in Figure 5 . For relatively, short-length transforms the chip area used by the preprocessing stage represents an important percentage. Also, the hardware complexity of our design of 2 multipliers, 3(N + 1)/2 adders and (N − 1)/2 · 2 L ROM words is significantly reduced as compared with (N − 1)/2 multipliers, 3(N − 1)/2 adders in [33] .
Conclusions
In this paper, a new memory-based design approach that leads to a reduced hardware complexity and a highthroughput VLSI implementation based on a new reformulation of DCT having good quantization properties is presented. It uses a parallel VLSI algorithm using parallel cycle and pseudo-cycle convolutions for a memory-based VLSI systolic array implementation. This approach using a new input-restructuring sequence leads to an efficient VLSI implementation with a substantially reduction of the hardware overhead involved by the preprocessing stage of the VLSI array. Moreover, the proposed VLSI algorithm and its associated architecture have good numerical properties that can be efficiently exploited to further reduce the hardware complexity and to obtain a low power implementation. We have shown analytically and by computer simulations that the proposed solution has one of the best quantization property for DCT that is better than that of the direct-method implementation known to be robust and frequently used in commercial products. The convolution structures can be efficiently implemented using a memory-based systolic array architecture paradigm. The differences in the sign can be efficiently managed using a tag-control scheme. Also, the proposed ROM-based implementation has better numerical properties compared to similar hardwired multiplier-based DCT implementations. It can thus be obtained a new VLSI implementation with a high degree of parallelism and good architectural topology with a high degree of regularity and modularity and an efficient fixed-point implementation that is well adapted for a VLSI realization. Thus, a new memorybased VLSI systolic array with a high-throughput and a substantially reduced hardware complexity can be obtained.
