This paper describes a flexible architecture for implementing a new fast computation of the discrete Fourier and Hartley transforms, which is based on a matrix Laurent series. The device calculates the transforms based on a single bit selection operator. The hardware structure and synthesis are presented, which handled a 16-point fast transform in 65 nsec, with a Xilinx SPARTAN 3E device.
INTRODUCTION
Fourier transforms play a major role in the fields related to Signal Processing [1, 2] . The successful application of transform techniques is mainly due to the existence of the so-called fast algorithms [3] . This paper proposes a new fast algorithm and its hardware implementation, for computing the discrete Fourier (DFT) [4] and Hartley (DHT) [5] 
The Discrete Hartley Transform (DHT) is defined by Equation (2) [6] .
From equations (1) and (2), it is apparent that the DFT can be computed from the DHT by Equation (3).
In 1965, J.W. Cooley and J.W. Tukey introduced a revolutionary idea which later became known as the Fast Fourier Transform (FFT) [4] . The FFT is a milestone in the theory of algorithms [7] [8] [9] .
In 1984, R. N. Bracewell introduced an algorithm for performing the discrete Hartley transform (DHT) [4] . With the advent of VLSI and the development of the Digital Signal Processor to implement signal processing techniques, the DFT became the most attractive tool for spectrum evaluation [10] [11] [12] . The cost reduction of DSPs and the astonishing capacity achieved by up to date processors (e.g., dozens of GFlops-Giga floating-point operations per second) [13] is turning real-time application feasible for several kind of signals. Therefore, discrete transforms became the widespread tool in spectral analysis [14] . A lucid tutorial review of fast Fourier techniques is available in [15] [16] . In 2000, an algorithm based on multilayer decomposition to calculate the DFT via the DHT was introduced [17] [18] . This paper proposes a flexible implementation for the FFT and the Fast Hartley Transform (FHT), using a new approach, which is derived from a matrix-based Laurent series expansion [19] .
THE FFT/FHT ALGORITHM
The fast algorithm is written according to the following DFT matrix decomposition: ) (M e  and ) (M m  denotes the real and imaginary parts of the matrix M , equations (3) and (4), respectively [20] .
where the matrices (4) and (5), the DHT components can be computed by Equation (3).
DESIGN METHODOLOGY AND ARCHITECTURE
The design was carried out through the steps: specification, VHDL description, behavioral simulation and synthesis.
Specification
The project aims at the production of a fast DFT/DHT computer, for real sequences with blocklength 16, every component of which is represented by a 16-bit word. The computations are to be made using fixed-point arithmetic (7 bits). Every component of the output sequence is represented by a 32-bit word (16 bits for its real part and 16 bits for its imaginary part). With a single bit selection, the user can choose which transform is to be computed.
VHDL Description
The device VHDL description was generated with the aid Matlab™ (Simulink), according to Equations (5) and (6).
Behavioral Simulation
With Xilins ISE tool, the VHDL description was compiled and simulated to check the output data. Syntax errors were removed at this point. The behavioral simulation was carried out by Xilinx ISE tool, and testbench files were generated. 
Synthesis
At this step, the VHDL code was analysed and optimized by the synthesis tool, in order to create an efficient implementation of the device. A Register Transfer Level (RTL) scheme was generated. The construction tool generates the file to be burn-in in the chosen device, namely, the Spartan 3E, xc3s500e-5-fg320 device. In the postsynthesis simulation, the finished circuit process occurs at 65 ns, and all obtained results corroborate the previous behavioral simulation. Main characteristics of the hardware: number of slices is 1611 out of 4656 (34%), number of slice flip-flops is 656 out of 9312 (7%), number of 4 input LUTs is 2894 out of 9312 (31%), number of IOs is 56, number of bonded IOBs is 56 out of 232 (24%), number of MULT18X18SIOs is 12 out of 20 (60%), and number of GCLKs is 2 out of 24 (8%).
Architecture
The device architecture is based on two main blocks, the memory management and the core block, as shown in Fig.  1 . The block of memory management stores the components of the input signal and the output transform. The core block is responsible for calculating the DFT coefficients, according to Fig. 2 . From these, the memory management block selects and computes the desired transform. After storage in the memory, the device calculates the transforms based on the selected operator (DFT/DHT). Thereby, the transforms are stored in the memory, as follows. For the DFT: the first two bytes hold the real components and the last two hold the imaginary components; For the DHT: only the last 16 bits hold the real components. 
SIMULATION RESULTS
The implementation was simulated using Xilinx ISE tool to verify the accuracy of the device and the results were compared with the ones obtained from Simulink Table 1 . The output DFT sequences of the simulation carried out by the device was exactly the same as that one obtained by Simulink TM (Column 3, Table 1 ). Both are slightly different than the true-DFT sequence computed by an internal MatLab TM routine (Column 3, into brackets). The quantization error ( 0.3%) is due to the use of fixedpoint instead of floating-point arithmetic. Nevertheless, this error magnitude is acceptable for many applications.
CONCLUDING REMARKS
This paper presented the design and implementation on the xc3s500e-5-fg320 device, of an algorithm that is capable of fast computing the DFT/DHT coefficients of a real sequence with blocklength N=16. The device computes either transform based on a single bit selection. The device computed the DFT/DHT at 65 nsec, which is acceptable for applications such as audio, speech processing, biomedical signal processing and DSL. Fig. 3 Results for behavioral simulation in operation DHT using Xilinx ISE tool. Fig. 4 Results for behavioral simulation in operation DFT using Xilinx ISE tool.
