I. INTRODUCTION
The discrete Hartley transform (DHT) [l] , [2] is considered to be a good alternative to the discrete Fourier transform (DFT) since it requires only real number computations, which are much simpler than the complex number computations required in the DFT. Therefore, the DHT has been widely applied in the field of digital signal processing. Though computing the DHT involves only real number computations, the M-D DHT is still computation-intensive. Thus, a dedicated VLSI implementation for the M-D DHT is necessary to provide a high-speed, low-cost means of computing the M-D DHT. [8] are based on (2) and (3). However, these designs are used to realize only the T(k1, kz) in (2), and they do not consider the issue of how to efficiently realize (3) to obtain H ( k 1 , k~). Note that the complexity of From the perspective of VLSI implementation, the O ( N 2 ) additions and the poor data locality of T(k1, kz) in (3) impose much overhead in hardware cost, computation time, and control circuits. Moreover, as higher dimensional DHT problems are considered, this overhead will increase exponentially.
A. The DifJiculty in

B. The Design Concepts and Main Results of This Paper
Being different from the conventional approach which computes the M-D transforms using the same kind of 1-D transforms, our basic concept is that any kind of 1-D transforms can be used to compute a M-D transform if there are advantages in doing so. Following this concept, we intend to derive a kind of 1-D transform to compute the M-D DHT through considering both the separable computation and the efficient hardware implementation.
The implementation of triangular functions and multiplications is a key consideration in the design of VLSI architectures for the DHT. The coordinate rotation digital computer (CORDIC) structure is considered to be a good choice for these operations [SI, [lo] . Fig. 1 shows the basic CORDIC processor operating in the circular coordinate system, which realizes the following operations The terms (2, y) and (z', y') denote the coordinates before and after rotation, respectively. $ denotes the rotation angle. Note that in Fig. 1 2' and y' are expressed as combinations of the terms z, y, cos($), and sin($). The special property of CORDIC is that the same hardware cab support different output functions by simply changing the assignments of the input data. For example, it supports the functions o f f . [cos($) 
C. Related Research
In the literature, various DHT algorithms for VLSI implementation have been considered [5]- [8] .
In the following, we analyze these designs from four different points of view: algorithm, architecture, implementation, and application.
Encapsulated algorithm: The algorithms used in designs [SI- [8] can be categorized into two groups, the fast Hartley transform (FHT) algorithms [SI, [6] and the DHT algorithms Following these analyses, we adopt the DHT algorithms instead of the FHT algorithms to design the VLSI architectures in this paper. Architecture topology: Among various VLSI architectures, systolic arrays [I41 are a type of architecture that provides both high processing speeds and convenient VLSI implementation. The designs in [SI- [8] are based on systolic arrays. Their architecture topology consists of linear arrays and 2-D arrays. Basically, a 2-D systolic array has an I/O cost depending on the transform length N. For linear arrays, we can restrict the YO channels on the array boundaries so that the YO cost can be kept independent of the transform length. The arrays designed in this paper are linear arrays. We utilize the Tug control scheme [15] to arrange all the VO channels located at the two extreme ends of the arrays so as to minimize the YO cost.
Implementation: Basically, the operations required in computing the DHT are triangular functions, multiplications, and additions. The triangular functions can be precomputed in a host computer and sent to the arrays for computation. However, such an approach would add overhead to the YO cost [5] . Altematively, the Givens Rotor [7] and CORDIC processor [6], [8] are two functional elements that have been used to realize the operations of the DHT. They both employ the shift-and-add computation as the basic computing style. Since the property of the CORDIC is beneficial in deriving the new M-D DHT algorithm, we use the CORDIC processor as the main computing element in the systolic arrays designed in this paper.
Application: Basically, the designs In the above equation, it is assumed that N = 2p and the integer p 2 2. Note that (7) possesses properties similar to those of (6), but requires lighter computational load than (6) owing to two features.
First, the numbers of CORDIC rotations required to compute C,(k,) and S, (kl) are reduced to one half of those required in (6). That is, the upper limit of the index n, is reduced from N in (6) to : in (7). This phenomenon results in the reduction of the computation time by a factor of 2. Second, the computations of C, (k,) and C, (k, + :
) share the same intermediate results obtained from the CORDIC processors, except that the results may have different signs. That is, we can use a CORDIC processor to compute two output results at a time, which reduces the number of the CORDIC processors by a factor of 2. The same phenomenon is also found in computing S,(kt) and S,(k,+ : ) .
This saving in the number of the CORDIC processors will be even more apparent if pipelined CORDIC structures [ 161 are utilized.
HARDWARE IMPLEMENTATION
To illustrate the array architecture for the M-D DHT, in this section we consider an example of a 3-D 8 x 8 x 8 DHT.
Dependence graph: Fig. 2 shows the dependence graph (DG) to compute C, (k; ) and Si (k; ) in (7). The DG clearly illustrates the data 21 and CORDY:![x, y, z] denote the outputs of the CORDIC processor with inputs x, y, and 2. C,(k,) and Ss(ks), respectively, denote C,(kl,... , k t , n t + l , . . . , n~) and S1(ki,...,k,,n,+i,...,n~) . C,-i(n,) and S,-i(nz), respectively, denote C,-1 (kl,. . . , kt-l, n , , . . . , n~) and S,-1 (kl,. . . , k,-1, n , ,
S,(k,). (h)
The function of the nodes (an,k, =
'
' . ' , n M ) ) .
operations, the data dependency, and the control signals involved in this algorithm. In the DG, the nodes represent the operations to be
I I
Si-1(3) Si-10 Cil(3) Ci-1(7) Si1 (2) Si-1 (6) Ckl ( (7) Sip) Si(7)
2 a n k 8 ). Fig. 3. (a) The array architecture for computing c,(kl) and S,(kt). (b) The function of the PE's (\knzkz = executed, including the CORDIC rotations and additions as described in Fig. 2(b) . The directed arcs indicate a data dependency between two neighboring nodes; that is, the computed result from one node should be sent along an arc to be operated on the other node. Based on the DG-based array synthesis procedure [15] , [17] , we can obtain a linear array by suitably projecting the DG. If the DG shown in Fig. 2 is projected vertically, a PE should perform all the operations in the nodes along a certain column of the DG. Since there are addition paths dong the c o h n n , e.g., cl' = c l + CORDY2[x3, X I , &,,k,], the PE's would have accumulation loops. On the other hand, if the DG is projected horizontally, there will be data transmission loops in the PE's. Since the consumed time for the data transmission loops is shorter than that for the accumulation loops, using the two-level pipelining technique [18] , [19] can further reduce the cycle time of the array from the addition time to the data transmission time. This fact facilitates the speed improvement of the designed array.
Array architecture for the 1-D transform: Fig. 3 illustrates the linear array obtained from projecting the DG shown in Fig. 2 horizontally. In this array, the input data are first preprocessed and piped in from the topmost PE and then transmitted to the neighboring PE's rhythmically. Tug1 and Tug2 are 1-b control signals to decide which data operands the P E S select. Since the rotation angles used in the array are known in advance, the rotation directions {EcJ, i , j = 0,1,. . . , -l} instead of the rotation angles { y t 3 , i , j = 0,1,. . . , $ -1) are precalculated and stored in ROM's to reduce the hardware and U 0 cost. The output results are accumulated and drained out from the bottommost PE. Unlike the design in [8] , we use the Tug control scheme [15] to locate all the U 0 channels at the boundary PE's, which makes the U 0 cost of the array independent of the transform length N. Note that there is a scaling operator in the end of the linear array to perform the scaling operations required in the CORDIC processors. This operator can be easily realized by using memory look-up tables or a CORDIC processor operated in linear rotation mode [16] . Exploiting the symmetry of the triangular functions, the linear array requires only PE's, where each PE consists of a CORDIC processor, a ROM with N words, several data multiplexing circuits, and four adders, as illustrated in Fig. 4 . Additional two adders and substractors are required in the array for data preprocessing. The throughput of the array is two samples per cycle. The average computation time to compute a 1-D N-point transform is $ Tcycle.
The term Tcycle, which denotes the cycle time of the array, includes the consumed time for the data multiplexing, a CORDIC rotation, and one addition. The speed of the array can be further enhanced for high-speed DHT applications by utilizing the techniques presented in [20] - [22] , such as the pipelined CORDIC, the forward angle recording CORDIC algorithm, and the CORDIC algorithms with redundant arithmetic and reduced iterations. ( k l , . . . , k,,nt+l,...,nM) and   B(ki,....k,-i,n,,...,n~) . By splitting the arguments of the cosine and sine functions in C M ( k M ) and S M ( k M ) , we obtain APPENDLX B This appendix illustrates the derivation from (6) to (7). In the following, it is assumed that N = 2p and the integer p 2 2. First, considering the C, ( kt ) in (6) and using the symmetry of the cosine function, we obtain 
