Abstract-The problems of unified efficient computations of the discrete cosine transform (DCT), discrete sine transform (DST), discrete Hartley transform (DHT), and their inverse transforms are considered. In particular, a new scheme employing the time-recursive approach to compute these transforms is presented. Using such approach, unified parallel lattice structures that can dually generate the DCT and DST simultaneously as well as the DHT are developed. These structures can obtain the transformed data for sequential input timerecursively with throughput rate one per clock cycle and the total number of multipliers required is a linear function of the transform size N . Furthermore, there is no constraint on N . The resulting architectures are regular, modular, and without global communication so that they are very suitable for VLSI implementation for high-speed applications such as ISDN networks and HDTV systems. It is also shown in this paper that the DCT, DST, DHT and their inverse transforms share an almost identical lattice structure. The lattice structures can also be formulated into prelattice and postlattice realizations. Two methods, the SISO and double-lattice approaches, are developed to reduce the number of multipliers in the parallel lattice structure by 2N and N , respectively. The tradeoff between time and area for the block data processing is also considered. The concept of filter bank interpretation of the time-recursive sinusoidal transforms is also discussed.
INTRODUCTION RANSFORM coding has found many applications in
T image, speech, and digital signal transmission and processing. Due to the advances in ISDN networks and high definition television (HDTV) technology, high speed transmission of digital video signal has become very desirable. Among the many transforms, the discrete cosine transform (DCT), discrete sine transform (DST), and discrete Hartley transform (DHT) are very effective in transform coding applications to digital signals, such as speech and image signals. The DCT is the most widely used transform in speech and image processing for data compression. This is due to its better energy compaction property and its near optimal performance which is closest to that of the Karhunen-Loeve transform (KLT) among many discrete transforms for highly correlated signals, especially for the first-order Markov process [ 11- [3] . It was shown by Jain that the performance of the DST approaches that of the KLT for a first-order Markov se- quence with given boundary conditions, especially for signal with low correlation coefficients [4] , [5] . In 1983, Bracewell introduced the DHT [6] which uses a transform kernel similar to that of the discrete Fourier transform (DFT), except that it is a real-valued transform. Therefore, it is simpler than the DFT with respect to the computational complexity [7] . Like the DCT and DST, the DHT has found many applications in signal and image processing [6] , [SI, [24] , [28] .
Since the DCT was introduced, many algorithms were proposed to improve the computation speed and to reduce the hardware complexity. These algorithms can be classified into the following categories: 1) indirect computation, 2) matrix factorization, 3) recursive computation, and 4) systolic structure implementation. The indirect computation [9] , [lo]-[ 131 applies the existing fast algorithms in the DFT or the Walsh-Hadamard transform to the DCT. It is not particularly efficient because the inherent properties of the DCT are not exploited. The matrix factorization [ 141, [ 151, 1251, [26] decomposes the DCT into multiplications of many sparse matrices, therefore the numbers of multiplications and additions can be substantially reduced. The recursive computations [ 161, [7] calculate higher order DCT coefficients from lower-order ones, but their signal flow architectures need global communication which is not suitable for VLSI implementation. By using the recursive properties effectively, this kind of DCT algorithms has fewer multipliers and adders, while additional multiplexers are required. As for the systolic structure implementation [ 171, [ 181, [27] , it uses existing systolic architectures for the DFT or other transforms to implement the DCT in a systolic manner. But some of the methods require that the number of samples of the signal must be decomposed into mutually prime numbers. Like the DCT, many fast algorithms have been proposed to improve the performance of the DST and DHT [8] , [19] , [20] , [4] , [5] . Basically, they can be classified in the same ways as those of the DCT and similar advantages and disadvantages can also be found.
In this paper, we propose unified time-recursive lattice structures that can be used for the discrete orthogonal transforms mentioned above, i.e., the DCT, DST, and DHT. We consider the orthogonal transforms from a timerecursive point of view instead of the whole block of data. We do so because in digital signal transmission, data arrive serially. Also, many operations such as filtering and coding are done in a time-recursive way. Based on this 1053-587X/93$03.00 $J 1993 IEEE I IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 41, NO. 3, MARCH 1993 approach, the resulting architectures are almost identical for the DCT, DST, and DHT, and their inverses. Our structures decouple the transformed data components, hence, there is no global communication needed. Besides, the number of multipliers in these structures is a linear function of N, so they require fewer multipliers than most other algorithms when N is large. Therefore, our architectures are very suitable for VLSI implementation. One of the important characteristics of these structures is that the transform size N can be any integer, which is not the case for most of the fast algorithms for discrete transforms which do have certain constraints on N. Another important result is that based on the time-recursive approach, the dual generation properties of the DCT, DST, and DHT, as well as some related inverse transforms, can be obtained.
The rest of the paper is organized as follows. In Section 11, the dual generation of lattice structures for the DCT and DST with the time-recursive approach is considered. The inverse discrete cosine transform (IDCT) and the inverse discrete sine transform (IDST) based on the lattice structures are discussed in Section 111. In Section IV, the time-recursive lattice structure for the DHT is presented. All the above time-recursive properties are derived by updating the time index by one. With block data processing, the time index is updated by more than one. The detailed effects and results of the block data processing are discussed in Section V. Denormalized methods to reduce the number of multipliers in those lattice structures are considered in Section VI. Then we compare these kinds of lattice structures with other architectures in terms of the number of multipliers and adders in Section VII. The synthesis bank structures based on the time-recursive concept is discussed in Section VIII. Finally, we give the conclusion in Section IX.
DUAL GENERATION OF DCT AND DST
We will show an efficient implementation of the DCT from the time-recursive point of view as an alternative to find fast algorithms through matrix factorizations or convert the DCT to DFT, which can be implemented on various existing architectures. Focusing on the sequence instead of the block of input data, we can obtain not only the time-recursive relation between the DCT of two successive data sequences, but also the fundamental relation between the DCT and DST. In the following, the timerecursive relation for the DCT will be considered first.
A . Time-Recursive Discrete Cosine Transform
The one-dimensional (1-D) DCT of a sequential input data starting from x ( t ) and ending with x ( t + N -1) is defined as where I f k = 0 otherwise.
Here the time index t in X, (k, t ) denotes that the transform starts from x ( t ) . Since the function C(k) has a different value only when k = 0, we can consider those cases that C(k)'s equal one (i.e., k = 1, 2,
, N -1) first and * reexamine the case for k = 0 later on. In transmission systems data arrive seriesly , therefore we are interested in the 1-D DCT of the next input data vector [ x ( t + l),
. From the definition, it is given by XAk, t + 1)
This can be rewritten as
and ( Note that the range of k is from 1 to N . Again, we consider those cases that D(k)'s equal one first, i.e.,
The DST of the time update sequence [x(t + l ) ,
Here the terms X, ( Fig. 1 . The next step is to update x , ( k , t + 1 ) and x , ( k , t + 1) from the previous transforms X, (k, t ) and X,(k, t ) . We notice that X , ( k , t ) and x , ( k , t + 1) have similar terms except the old datum x ( t ) and the incoming new datum x ( t + N ) . Therefore x , ( k , t + 1) and X,(k, t + 1) can be obtained by deleting the term associated with the old datum x (t) and updating the new datum
From ( ( k , t ) and X , ( k , t) are given by
2 . 
The relation of X,(O, t + 1) with the old transformed da-
And, the time-recursive relation between the new transforms X,(N, t + 1) and the previous transforms X, (N, t ) is
The complete time-recursive lattice modules for (k = 1, 2 , * * , N -1.) are shown in Fig. 2 . It consists of a N -(i) sin ($). + 1 shift register and a normalized digital filter performing the plane rotation. The multiplications in the plane rotation can be reduced to addition and subtraction for k = 0 in the DCT and k = N in the DST, respectively, these two cases can be simplified and combined together as shown in Fig. 3 .
The following illustrates how this dually generated DCT and DST lattice structure works to obtain the DCT and DST with length N of a series of input data [ x ( t ) , x ( t A parallel lattice array consists of N lattice modules can be used for parallel computations and it improves the computational speed drastically as shown in Fig. 4 . Here we have seen that the transform domain data X(k, t) have been decomposed into N disjoint components that have the same lattice modules with different multiplier coefficients in them. In this case the total computational delay time decreases to N clock cycle. It is important to notice that when the next input datum x ( t + N ) arrives, the transformed data of the input data vector [ x ( t + l ) , x ( t + 2), * * * , x ( t + N ) ] can be obtained immediately.
Likewise, it takes only one clock cycle to generate the transformed data of subsequent inputs. That is, the latency and throughput of this parallel system are N and 1, respectively.
It is obvious that this lattice structure is quite different from the signal flow graph realization obtained from the fast DCT algorithms [14] , [15] . Since there is no global communication and the structure is modular and regular, it is suitable for practical VLSI implementation. The most interesting result is that this architecture can be applied to any value of N . From this point of view, it is more attractive than existing algorithms. In fact, most algorithms [21], [18] are limited to the sequence length N which either must be power of 2 or must be decomposable into mutually prime numbers. In addition, this lattice structure reveals some interesting properties of the DCT and DST, i.e., the DCT and DST can be generated simultaneously. The DCT is near optimal to the KLT transform in highly correlated signals, while the DST approaches the KLT in signals with low correlation coefficient. As we are able to obtain the DCT and DST at the same time, this lattice use a single lattice module with only 6 multipliers and 5 adders to recursively compute any N-point DCT and DST simultaneously. To obtain the transformed data in parallel, we need N lattice modules. As mentioned before, it is suitable for VLSI implementation since all the modules have the same structure except the 0th module which can be simplified as shown in Fig. 3 . This parallel lattice structure requires 6N -4 multipliers and 5N -1 adders.
INVERSE DISCRETE COSINE TRANSFORM (IDCT) AND INVERSE DISCRETE SINE TRANSFORM (IDST)
A . Time-Recursive IDCT structure is very useful especially when we do not know the statistics of the incoming signal. Furthermore, we can According to the definition of the DCT in ( l ) , the IDCT for the transform domain sequence [ X ( t ) , X ( t -t-I), * * * ,
The coefficients C(k)'s are given in (1). From the timerecursive point of view, the IDCT of the new sequence 
In order to be a dually generated pair of the IDCT given in (16), we define the auxiliary inverse discrete sine transform (AIDST) as n = O , l ; * * , N -1 .
(21) Although this definition utilizes the same sine functions as the transform kernel, it is not the inverse transform of the DST. To differentiate it from the IDST, we call this the AIDST. Compared to the IDST defined in (26), we observe that the AIDST has the special coefficients C(0) = I/&' associated with the first term, while the IDST with the last term. The AIDST for the data sequence [X(t + l), X(t + 2 ) , * * , X ( t + N ) ] can be written as
By using the trigonometric function expansions, xas(n, t
-Z,(n, t + 1) sin I) Lattice Structure for IDCT: Combining (18) and (23), we observe that the IDCT and AIDST can be generated in exactly the same way as the dual generation of the DCT and DST. Therefore, the lattice structure in Fig.  1 can be applied here except that the coefficients must be modified. Since the coefficients C(k)'s are inside the expression in the inverse transform, the relation between x,(n, t) and F,(n, t + 1) will be different from what we have in the DCT. Equations (16) and (19) as well as (20) and (21) have the same terms fork E { t + 2, t + 3, * * , t + N -1 } . After adding the effects of the terms for k = t and k = t + 1 , we obtain and T,,(n, t + 1) = xUs(n, t) + (-l)"X(t + N)
The complete lattice module for the IDCT and AIDST is shown in Fig. 5 . This IDCT lattice structure has the same lattice module as that of the DCT except for the input stage where one more adder and one more multiplier are required. The procedure to calculate the inverse transformed data is the same. Therefore, this IDCT lattice structure has the same advantages as that of the DCT. To obtain the inverse transform in parallel, we need N such IDCT lattice modules where 7 N multipliers and 6N adders are required. Again, we see that the numbers of adders and multipliers are linear functions of N . Here we should notice that to obtain the inverse transform of the original input data sequence, for example, [x (0), x (l), x (2), * * , 
The lattice module of this rearranged IDST and AIDCT is shown in Fig. 7 . This structure differs from all the previous lattice modules in that the input signals are added at the end of the lattice. From now on, we call this lattice structure a postlattice module and the previous ones prelattice modules. This postlattice module needs 7 multipliers and 7 adders, less than required for the corresponding prelattice module. A parallel post-lattice structure, which generates N transformed data simultaneously, requires 7N multipliers and 7N adders. All the forward and inverse transform pairs mentioned above have prelattice and postlattice structures. Not all postlattice structures are superior to their prelattice counterparts in the hardware complexity. For example, the IDCT and AIDST postlattice form can be expressed as
Y2,+ "1
(38) and -X(t + 1) + ($ -1 ) X ( t + 1) ( 
39)
This postlattice module has 9 multipliers and 7 adders which are more than its prelattice realization. As to the DCT and DST, the postlattice form can be expressed as In this case, the prelattice and postlattice modules have the same numbers of multipliers and adders.
IV. DISCRETE HARTLEY TRANSFORM (DHT)
According to Bracewell's definition of the DHT in [ 6 ] , the data sequence x (n) and the DHT transformed data X ( k ) have the following relation:
The DHT uses real expressions cos (27rk(n -t ) / N ) + sin ( 2 a k ( n -t ) / N ) as the transform kernel, while discrete Fourier transform (DFT) uses the complex exponential expression exp (i2ak (n -t ) / N ) as the transform kernel. Because the kernel of the DHT is a summation of cosine and sine terms, we can separate them into a combination of a DCT-like and a DST-like transforms as follows: The lattice module for the DHT is shown in Fig. 8 . For the case of k = 0, the lattice structure can be simplified as shown in Fig. 9 . From Fig. 8 , we can see that the numbers of multipliers and adders are less than those of 
V . BLOCK PROCESSING
are based on the block-size-one update which means the time index is updated by one. That is, at each iteration only the effect of one old datum is removed and the information of one new datum is added. We are interested block size. This motivates us to discuss the effect on the lattice structure when the block size is increased.
All the time-recursive discrete transforms derived above
in the relation between the area-time complexity (AT) and (56) and A. Block Processing of Time-Recursive DCT and DST We begin the discussion of block processing with the block-size-two update. Here we assume the time index t in (1) is zero for simplicity, and we will use this in the following discussions. As before, the transformed data X , ( To obtain Xc(k, 2) from X,(k, 0) and X,(k, 2) from X,(k, 0) directly, we can rewrite Xc(k, 2) and X,(k, 2) as
nk 2ak
Xc(k, 2) = xc(k, 2) cos (7) + x,(k, 2) sin (7)
The lattice module for the block-size-two update is shown in Fig. 10 . There are two more multipliers in the lattice, i.e., the total number of multipliers is eight. To obtain the transformed data in parallel, we need N such lattice mod- 
Combining those input terms with same cosine and sine multiplier coefficients together, we can obtain the lattice module for block size m as shown in Fig. 11 . To obtain the transform data X ( k ) in parallel, N lattice modules of Fig. 1 1 are required . The total number of multipliers of the parallel structure is (4 + 2m)N, the total number of adders is (3m + 2)N, and the throughput is 1 . The areatime complexity due to multipliers and adders are 2N)N + 3N2 log ( 3 N 2 ) . In general, the area-time product gets smaller as block size m decreases. We found that when m = 1 , the minimum AT complexity is achieved. system. In this section, we develop two methods to reduce the number of multipliers in our parallel lattice structures. The first scheme makes use of a series input series output (SISO) approach and 2N multipliers can be saved; the tradeoff is that the latency and throughput is increased. The second approach, which reconstructs the structure into a double-lattice realization, saves N multipliers and the latency remains intact.
VI. MULTIPLIER REDUCTION OF THE LATTICE STRUCTURE

A. SISO Approach
Let us consider this problem through a general lattice structure as shown in Fig. 12 . Denote the output and input data at time t as (X,(t), X,(t)) and (x,,, x,,) , respectively, where the input and output have the following relations:
By dividing both equations by r4, we have x,(t)/r, = [ U t -1) + rl-%,lr2/~4
(65)
The lattice structure manifesting the above relations is shown in Fig. 13 . It is noted that only four multipliers exist in this structure and the outputs obtained differ from the original one by a factor r4. To examine the effect of this multiplier reduction on the recursive operation from X, (1) to X, ( N ) , we start with the derivation from t = 1 . 
That is
(68)
The coefficients of the input multipliers are r1/r4 and r 3 / r 4 , instead of I', and r3 at times t = 1, and the outputs are Xc(2) /I' : and Xs(2) /r:. For t = N , the recursive equations become
From the above derivations, we observe that the two multipliers can be removed by using variable multipliers in the in ut stage where the coefficients (rl, l?l/I'4, r1/I'F-') and (r3, r 3 / r 4 , -. , r3/F:-') are store; in shift registers. The structure is shown in Fig. 14 registers are required and the latency becomes 2N instead of N . Also, this resulting structure is a SISO system, while the original parallel structure is a S I P 0 system. For example, the variable-multiplier method derived above can be applied to the lattice structure of the DCT and DST. There are no multipliers needed for t = 0, therefore the module remains the same . For t = 1, 2, . . . , N -1, the multiplier-reduced lattice structure is shown in Fig. 15 , where the coefficients are = cos ( k~/ 2 N ) , r2 = cos ( k n / N ) , r3 = sin ( k a / 2 N ) , and r4 = sin ( k ? r / N ) . The total number of multipliers is 4N -2 and the latency for this SISO structure is 2N.
It is readily seen that the SISO approach for multiplier reduction is in fact a denormalization of the orthogonal rotation in the lattice. It is well known that the orthogonal rotation is numerical stable so that the roundoff errors will not be accumulated. However, the denormalized lattice does not have such a nice numerical property in finiteprecision implementation, i.e., the roundoff errors may continue to accumulate and lower the signal-to-noise ratio. This effect can be minimized by giving enough register length such as double precision in the implementation. Also, we note that since r4 < l , I't could be very small. Not enough precision may result in bad numerical accuracy when is multiplied at the output stage. Thus, the registers that store I'f do need enough precision to avoid the accuracy problem. The problems addressed here are consequences of the tradeoff between complexity and performance.
B. Double-Lattice Approach
forms:
Generally, a postlattice structure has the following
denotes r i g h t s h i f t one b i t The operational flow chart of (73) is illustrated in Fig. 16 . Instead of calculating the outputs from (70) directly (that requires 6 multipliers and 4 adders), the first lattice adds and subtracts X,(k -1) and X,(k -l ) , then multiplies the results by r2 + r4 and r2 -r4, respectively. The results are called t 1 and t 2 as defined in (74) and ( 7 5 ) . The second lattice adds and subtracts t I and t 2 again, then divides the results by 2 , which can be achieved by right shifting. Finally, we complete the computations by adding the inputs FIXck and r3xck -2r4X,(k -1). This reconstruction can save one multiplier. A parallel postlattice structure with N lattice modules requires 6N multipliers and 4N adders. As for this reconstructed parallel structure, only 5N multipliers and 7 N adders are needed. This approach can be applied to all the parallel postlattice structures of different orthogonal transforms. In general, this parallel double-lattice structure can save N multipliers, but requires 3N more adders. The latency is N clock cycles and the system remains SIPO.
VII. COMPARISONS OF ARCHITECTURES From the previous discussions, we see that the proposed unified parallel lattice structures have many attractive features. There are no constraints on the transform size N . It dually generates the two discrete transforms DCT and DST simultaneously. Since it produces the transformed data of subsequent input vector every clock cycle, it is especially efficient for systems with series input data such as communication systems. Further, the structure is regular, modular, and without global communication. As a consequence, it is suitable for VLSI implementation.
Here, we would like to compare our lattice structures of the DCT and DST with those proposed in [ 1 4 ] , [15] , [ 7 ] . The architecture in [14] uses the matrix factorization method which is a representative of fast algorithms. In [15], an improved fast structure with fewer multipliers is proposed. Hou's architecture in [7] uses recursive computations to generate the higher order DCT from the lower order one. The characteristics of these structures are discussed in the introduction. A comparison regarding their inherent properties is listed in Table I . To be clear, the quantitative comparisons in terms of the parameters, which are the numbers of multipliers, adders, and the latency, are given in Tables 11-IV. The lattice architecture with six multipliers in the module as shown in Fig. 2 is called Liu-Chiul structure, the one in Fig. 15 is called Liu-Chiu2, and the parallel structure with the double-lattice modules as shown in Fig. 16 is called Liu-Chiu3. The structure in Liu-Chiul has 6N -4 multipliers, 5N -1 adders, and the latency is N. There are 4N multipliers, 5N -1 adders, and the latency is 2N in the structure of Liu-Chiu2. The number of multipliers is reduced by the order 2N in the expense of doubling the latency and the data flow becoming SISO. The Liu-Chiu3 architecture has 5N multipliers and 7 N adders and the latency is N clock cycles. From these tables, it is noted that the number of multipliers in our architectures is higher than that of others when N is small. This is due to the dual generation of two transforms structure which is compatible with Lee's. Since the numbers of multipliers and adders of our structures are on the order N, our algorithms have fewer multipliers and adders than those proposed in [14] , [15] . Although Hou's algorithm has the fewest multipliers, his architecture needs global communications and the design complexity is much higher than ours. In addition, the operations of other structures cannot start until all of the data in the block arrive.
A comparison for our DHT structure based on the lattice module in Fig. 8 and different DHT algorithms [23] , [18] is listed in Table V . The architecture in [23] , a representative fast algorithm, is developed base on the existing FFT method. Chaitali-JaJa's algorithm in [18] decomposes the transform size N into mutually prime numbers and implements them in a systolic manner. Their structure needs extra registers and the latency is higher than others. It is easy to see that our structure is better than others in terms of hardware complexity and speed.
VIII. FILTER BANK INTERPRETATION OF THE TIME-RECURSIVE SINUSOIDAL TRANSFORMS Multirate digital filters and filter banks find applications in communications, speech processing, and image compression. There are two basic types of filter banks. An analysis bank is a set of analysis filters Hk(z) and Nfold decimators which split a signal into N subbands. A synthesis filter bank (the right part of Fig. 17) consists of N synthesis filters Fk (2) and N-fold interjiolators, which combine N signals into a reconstructed signal i ( n ) . As described in Section 11, the time-recursive approach decomposed the transformed domain data into N different components. If we are interested in the block-size4 transform and perform the N-fold decimation in the outputs of every lattice modules, the analysis bank is simply the seriesinput-parallel-output filter bank described in Fig. 17 . Under this condition, the analysis bank is equivalent to perform a transformation and the synthesis bank to perform an inverse transformation on successive blocks of N data samples. In this section, we describe how to employ the time-recursive concept to generate the synthesis banks based on the DCT, DST, and DHT.
A . Synthesis Bank Structure Based on DCT
To perform the inverse transform in the synthesis bank, we feed the DCT transformed domain components X,(k) into the synthesis modules and combine all the outputs of every synthesis modules to produce the original input sequence x,(n). That is, the synthesis bank performs the following inverse DCT operations:
transform generated by a specific synthesis filter. We can obtain the following recursive-generated relations for Tc(n, k) and Za,(n, k) as
Since in the synthesis bank different transform components are sent to independent synthesis modules, we therefore focus on a specific transform component. Denote x,(n, k) as the output signal generated by a specific synthesis module (78) and 1
The time-recursive concept can be applied here to update Fc(n, k) recursively. Use the result in Section 111-A that IDCT and AIDST can be generated from each other recursively and denote x&, k) as the auxiliary inverse sine Z,,(n, k) cos (g) + F,(n, k) sin (g) . this means that the x,(n + 1, k) and x,,(n + 1, k) can be generated by sending a sequence with Xc(k) as the first element followed by N -1 zeros into the input of the synthesis module. This is exactly the up sampling procedure required in the synthesis bank structure. The XaS(n, k) output is reset every N clock cycles. The synthesis module diagram for the DCT is plotted in Fig. 18 . The inverse transform is obtained by summing all the outputs of the synthesis modules.
B. Synthesis Bank Structure of the DST and DHT
In this section, we apply the same approach mentioned in the previous section to the DST and DHT. The results are summarized as below. By using the dual generation concept, the operation of the synthesis module for the DST is sin ($) . To obtain the IDHT xh(n) we must sum up both of the outputs of the synthesis modules. It is noted that the multiplier coefficients in the synthesis module for the DHT is different from that of the DCT and DST.
IX. CONCLUSIONS In this paper, unified time-recursive algorithms and lattice structures that can be applied to the DCT, DST, DHT, and their inverse transforms, are considered. In fact, there are various forms of sin and cosine transform pairs (the DCTUDSTI, DCTIVDSTII, DCTIIUDSTIII, DCTIV/ DSTIV, and complex lapped transform (CLT)) as mentioned in [22], [33] . They also have their time-recursive lattice realizations. The procedures to attain the lattice structures of different transforms are similar and the resulting S I P 0 lattice structures differ only in the multiplying coefficients and the input stage. All the transform pairs have their pre-and postlattice realizations that differ in that the input signals are added in the front and the end of the lattice respectively. The hardware complexity of the pre-lattice realizations and their postlattice counterparts depends on the definitions of the transforms and it cannot be readily determined which one is better. The number of multipliers in all the parallel lattice structures is a linear function of the transform size Nand the latency is N clock cycles. Two methods, the SISO and double-lattice approaches, are developed to reduce the number of multipliers for the parallel lattice structures. The SISO approach can reduce 2N multipliers and the latency becomes 2N. The dpuble-lattice approach can reduce N multipliers and the latency remains intact. From the discussion of the block processing, it is noted that the area-time complexity is efficient when the block size m is small, especially when m = 1. All the resulting parallel structures are modular, regular, and only locally connected. Further, there is no constraint on the transform size N . It is obvious that the design complexity of these structures is relatively low compared with other algorithms. The characteristics of these algorithms are suitable for processing series input data since the transformed data for sequential input can be obtained every clock cycle. Therefore, it is very attractive to VLSI implementations and high speed applications such as HDTV signal coding and transmission.
Since the orthogonal rotation is the major operation in the lattice, it is noted that such rotation can be easily implemented using coordinate rotation digital computer (CORDIC) [29] , [30] which is known as an efficient method for the computation of orthogonal rotations and trigonometric functions.
