Abstract: Recently, a novel systolic structure has been proposed for the computation of DFT for transform length N = 4 M , M being prime to 4. In this paper, we have proposed a similar structure for the computation of DHT by prime factor decomposition. A new recursive algorithm is also proposed for computing DHT using a linear systolic array of cordic processing elements. The proposed structure has nearly the same hardware requirement as that of the corresponding DFT structure for real-valued data; but it yields significantly higher throughput, prime factor decomposition used in this paper is different from that used in Reference 4. A new recursive algorithm is also proposed for efficient computation of the DHT by a linear systolic array of cordic processing elements (PE). It is shown that the proposed structure has nearly the same hardware requirement as the corresponding DFT structure [ 5 ] for real-valued data; but it yields significantly higher throughput.
Introduction
The discrete Hartley transform (DHT) of a real-valued sequence {x(n)} is defined [ l ] as and where 1/N is the scale factor. Note: l J N is ignored in the rest of the paper. This transform has been emerging as an useful alternative to the discrete Fourier transform (DFT) to avoid the complex arithmetic in various signal processing applications. Several attempts have therefore been made to develop efficient DHT algorithms to increase the computational speed but, only a few reports have been made so far of hardware implementation of the DHT. Boussakta and Holt [ 2 ] have proposed a method for computing the DHT by Fermat number transform (FNT) using a VLSI chip. Chakrabarti and Ja'Ja' 131 have presented a bit serial solution for small DHT modules. Also they have suggested a systolic architecture for the prime factor DHT [ 4 ] which is computed via four temporary outputs.
Recently, Jones [ S I has proposed a novel systolic structure for the computation of DFT for transform length N = 4 M , M being prime to 4. In this paper, we have proposed a structure similar to that of Reference 5 for the computation of prime factor DHT. The scheme of t IEE, 1993
where z(n) = x(n) + j x ( N -n), a = e-'Zn" and
and N is an odd number. Note: The upper limit of the summation index n of eqn. 2 would be N J 2 when N is even. The DHT components given by eqn. 2 may be written as 
136

Cordic circuits for processing elements
Eqns. 5 and 6 imply that the complex multiplications of the form (xo +jyo)e-ja as well as (yo +jxo)@ may be computed via phase rotations and scaling. By setting 6i to be a power of 2, phase rotations can be achieved by repeated shift-add operations. It is shown [7] that scaling can also be performed by repeated shift-add operations. The pair of complex multiplications required in each recursion to compute Y(k) and Y(N -k) may therefore be performed in a cordic circuit by the same sequence of shift-add operations. Two different cordic circuits to be used for these multiplications are depicted in Fig. 1 . The systolic array consisting of (N + 1)/2 cordic PES to compute N-point DHT is shown in Fig. 2a . The function of the (k + 1)th PE is described in Fig. 2b . The addition of the real parts with the corresponding imaginary parts of the outputs of the (k + 1)th PE yields the kth and (N -k)th DHT components. The prime factor DHT algorithm and its implementation (104 2). u(k,, n,) and $k,, n,) in eqns. 1Oc and 10d represent the even and the odd parts of the DHT of elements of the n, th column of x(nl, n,), respectively. w(k,, 0) in eqn. 10e denotes the DHT of the zeroth column of x(nl, n,). 
The proposed structure for a two-factor DHT, where NI = 4 and N, is prime to 4 is shown in Fig. 3A . It consists of two sections. The first section of the structure consists of 12 complex adders and three delay cells (depicted in Fig. 3B ). Four rows of w(k,, n,) are pipelined out from it, and get queued through a buffer and a channel selector, to be further pipelined in proper order
IEE PROCEEDINGS-G, Vol. 140, No. 2, A P R I L I993
Throughput and hardware considerations
As mentioned earlier, during each recursion, a PE performs two complex multiplications, where each of these multiplications is implemented through L iterations. Each iteration comprises one phase rotation followed by a scaling. Furthermore, either a rotation or a scaling involves two additions and two shifts. All these additions and shifts are performed by two pairs of parallel adders and shifters if cordic circuit I (Fig. la) is used by PES, The total computational load per pair of adder and shzter then comes out to be 4L additions and 4L shifts.
To maintain a precision of L correct bits at the output, the input data needs additional log, L bits as guard bits to account for the internal cordic arithmetic [9] . Assuming that addition of two single bits takes one microcycle, a total of ( L + log, L) microcycles are necessary for the addition of two ( L + log, L) bit words. It may be noted here that the additions corresponding to one multiplication and the shift operations corresponding to the other multiplication are performed simultaneously. Hence, no extra time is required for the shift operations. The total time required per recursion in cordic circuit I may then be calculated to be 4L(L + log, L) microcycles.
It is stated in Section 3 that the first N-point DHT requires [(5/S)N + 9/21 recursions (or time steps) and the successive N-point DHTs require (N/2 + 2) recursions.
Thus the total time required to compute an N-point DHT is given by
If the transform length N + 4, this may be approximated to
In cordic circuit I1 (Fig. lb) , the additions and shifts are performed by four pairs of parallel adders and shifters.
Therefore, the duration of each time step in cordic circuit microcycles 
Function ofcomplex adders and delay cells
The DFT of real-valued data, and hence the DHT, may also be computed by the DFT structure [SI using the following two methods. 
Method
The hardware requirement of the first section of the proposed structure is the same as that of the DFT structure [ S I for Method 1 . For Method 2, however, three complex adders and a delay cell may be avoided in the first section of the structure 151; because only three column DFTs are required to be computed in this case at the output stage. Moreover, the amount of hardware used by the second section of the proposed structure, using cordic circuit I and cordic circuit I1 is nearly the same as those of the DFT structure for real valued data by Method 1 and Method 2, respectively.
Results and discussion
An efficient systolic structure is proposed for computing the DHT by prime factor decomposition for transform length N = N I x N,, where N , = 4 and N , is prime to 4 . A recursive algorithm is also proposed to compute N point DHT by a systolic array of ( N + 1)/2 cordic processing elements. The computation times of the prime factor DHT by the proposed structure, and the times by the DFT structure [5] for real-valued data using two different methods, are plotted in Fig. 4 . It is found that the proposed structure requires significantly less computation time than the other. It is also observed that the hardware requirement of the proposed structure, using cordic circuit I and cordic circuit I1 is nearly the same as that of the DFT structure The higher throughput obtained by the proposed DHT structure over the corresponding DFT structure
[S] for real-valued data is mainly due to the scheme of prime factor decomposition, and to the efficient use of the cordic circuits by the recursive DHT algorithm. 
