Following Moore's law, current submicron technologies have allowed extraordinary integration densities in digital circuits. However, as processes scale down, uncertainty increases (voltage, temperature, noise, coupling etc.) and the design process becomes more complicated, especially for ASICs (application specific integrated circuit), where margins are extremely tight. Moreover, mask costs have reached a critical point, dominating the manufacturing process and requiring high financial risk.
I. INTRODUCTION
Following Moore's law, current submicron technologies have allowed extraordinary integration densities in digital circuits. However, as processes scale down, uncertainty increases (voltage, temperature, noise, coupling etc.) and the design process becomes more complicated, especially for ASICs (application specific integrated circuit), where margins are extremely tight. Moreover, mask costs have reached a critical point, dominating the manufacturing process and requiring high financial risk.
In this context, field-programmable gate arrays (FPGAs) offer significant advantages at a suitable low cost [1] . First, the flexibility of FPGAs provides space to designers to modify implementations until the very last moment, or even correct mistakes once the product has been released. Second, the verification of a design mapped into an FPGA is very simple, contrasting with the huge verification effort required by ASICs. Finally, even though FPGAs are not as efficient as ASICs in terms of performance, area or power, it is true that now they can provide better performance than general purpose CPUs or digital Refereeing of this contribution was handled by J. Lee.
0018-9251/08/$25.00 c°2008 IEEE signal processor (DSP) based systems. This fact, in conjunction with the enormous logic capacity allowed by today's technologies, makes FPGAs an attractive choice for implementation of complex digital systems. Moreover, due to inclusion of digital signal processing capabilities [2] , FPGAs are now expanding their traditional prototyping roles to help offload computationally intensive digital signal processing functions from the processor.
In the signal processing field, electronic warfare receivers are a good example of complex systems with stringent requirements [3, 4] . The requirements for these receivers are wideband frequency coverage, high sensitivity and dynamic range, high probability of intercept, simultaneous signal detection, excellent frequency resolution, and real time operation. A classical receiver which accomplishes these requirements is a channelized receiver which separates signals according to their frequencies. Recent advancements in analog-to-digital converters (ADC) technology and in the speed of digital processors have made it possible to design relatively wideband digital channelized receivers [4, 5, 6, 7, 8, 9] . Fundamental problems in the analog receivers such as the imbalance between receiver channels can be surpassed by migrating to digital implementations. Digital implementations are inherently reliable and accurate. However, broadband digital channelized receivers, mainly based on fast Fourier transform (FFT) related processing, require intensive computation for real time applications. Typically, the system throughput of many signal processing algorithms can be improved by exploiting concurrency in the form of parallelism and pipelining [10] . FPGAs allow for true parallel processing by supporting multiple simultaneous threads of execution. Moreover, FPGAs have grown over the past decade to the point where broadband real time operation digital channelized receivers can be implemented on a single FPGA device.
Even though FPGAs offer new possibilities to the system designer, additional support is required to explore the design space because new design magnitudes must be considered. Typical hardware variables such as area, clock frequency, or power dissipation should be evaluated together with classical signal processing issues such as throughput, detection performance, and dynamic range. In this paper we present an in-depth study on the hardware implementation of FFT pipelined architectures suitable for broadband digital channelized receivers with continuous flow of input samples. Our purpose is to provide an analysis and comparison of existing FFT algorithms to help the signal processing engineer when selecting the best hardware architecture for a given application. Selection can be based on a broad range of parameters at two different levels: low level hardware implementation details (bitwidth, bit truncation, area, clock frequency, etc.) or high level application details (FFT radix and number of points, detection capabilities, etc.). Furthermore, the use of simplified FFT algorithms such as the monobit FFT [11, 12] is also considered. The exploration of all the implementation possibilities is supported by a high level design exploration tool especially developed for this purpose [13] , which alleviates the system engineer's work by hiding low level hardware details.
Important work has been carried out on hardware FFT architectures, in particular for FPGA implementation [14, 15, 16] , but all those previous works represent a partial approach to the problem. In this paper we provide a global view of the problem, paying attention not only to the design of the FFT architectures, but also analyzing all important parameters involved in their hardware implementation on FPGAs and studying the performance they provide from the detection point of view.
The structure of the paper is the following. A general description of the FFT algorithm and its basic implementation is described in Section II, while Section III describes the selected architectures in this work: feedback and feedforward. Experimental results are analyzed in Sections IV and V. Finally, the implementation of a channelized receiver based on the proposed architectures is described in Section VI and some conclusions are drawn.
II. THE FFT ALGORITHM
In this section we review the main equations of the discrete Fourier transform (DFT) to point out their mapping on hardware. The N-point DFT (DFT N ), X N (k) for a given sequence x N (n) is defined as
where
The FFT is based on the decomposition of a sequence of the DFT into lower order computations, which results in a reduction in the number of operations. In this way the complexity of the computations is reduced from O(N 2 ) to O(N log 2 N). There are many algorithms that compute the FFT. The one proposed by Cooley-Tukey [17] , which provides a very regular hardware implementation, is based on the successive decomposition of a DFT with length N into R DFTs with length N=R. R, known as radix, is a power of 2 and as a consequence the length of the transform will have a set of discrete values N = R S , where S corresponds to the set of successive decompositions required for the whole transform (S stages). The decomposition continues until the length of the sub-DFT matches the radix (the lowest order sub-DFT).
There are two basic approaches to implement the algorithm: decimation in time (DIT) or decimation in frequency (DIF). The difference between them is the way the algorithm performs the decomposition of the DFT into lower order DFTs, resulting in a different sequence of operations. For example, in the case of radix 2, the FFT of a sequence with length N (N = 2 S ) can use the decomposition of this sequence into two sets with odd and even samples (DIT implementation) or two sets with the first and the second half of the samples (DIF implementation). Even though this work focuses on DIF implementations, it could be easily extended to DIT implementations.
In the case of a DIF implementation with radix 2 the resulting decomposition is achieved 
Several issues can be outlined when analyzing the equations in (2) . First, the input samples can be complex numbers. Second, the most internal operands of the middle equations (between brackets) represent a mixture of samples known as butterfly, named from the shape of its flowgram. Third, we need to implement after that mixture the rotation of samples made by the corresponding twiddle terms. Once the samples are suitably mixed and rotated we obtain two half length sequences which can be manipulated separately. The previous decomposition procedure can be repeatedly applied to both sequences until the sub-DFT with the lowest order is reached (order 2 in this case) and the algorithm is completed.
To maximize the throughput of the implemented FFT, measured as processed Msamples/s, the simplification of the computational complexity of the FFT is required. This can be accomplished by avoiding complex multiplications: using a monobit kernel for the FFT [11, 12] . With this approximation the rotators always use the following angles: 0, ¼=2, ¼, or ¡¼=2. Hence, rotators are not necessary and this results in significant reduction in area and improvement in speed. However, this increase in the throughput is obtained at the expense of performance degradation. In the application described in this paper, there exists a degradation on the detection probability and dynamic range of the channelized receiver, as analyzed in Section V (see also [12] for more details).
III. HARDWARE ARCHITECTURES FOR THE FFT
There are many hardware implementations for both DIF and DIT algorithms. For instance, we can choose between digit-serial or bit-parallel arithmetic, or we can select between pipelined or iterative implementations. In the case of data-oriented applications presenting a continuous flow of samples, the best architectures are those that favor speed over area. The implementations that better fit these requirements are bit-parallel and pipelined architectures, where the processing is performed in several cascaded stages, as can be seen in Fig. 1 . We have chosen two main groups of FFT architectures, representing opposite points in the area-performance design space: feedback (FB) and feedforward (FF) architectures. Architectures with FB provide the output flow at the clock frequency (one sample per clock cycle), because the feedback structure allows the reuse of some elements present in every stage. On the other hand, FF structures provide a higher throughput (R samples per clock cycle, R being the radix) because reuse is not applied and higher concurrency can be obtained, paying the price of a significant area overhead.
The general architecture of a pipelined implementation is based on a set of stages (S in Section II) and each stage performs the decomposition of the input sequence into sub-DFTs, which are implemented in later stages. Every stage is characterized by the radix, which also sets the number of required stages to process an input sequence of length N.
There are three basic elements in each stage of both architectures, depicted in Fig. 2 : a memory where data between stages are stored, a butterfly where the mixture of samples is accomplished, and finally an element to multiply samples by the corresponding twiddles. The architectures differ in the way these elements are interconnected and how the sample flow is controlled. The dashed line that appears in Fig. 2 represents the FB loop of the FB architecture.
Once all stages have been accomplished (finishing the lowest order sub-DFT, which is of size radix), the processing can be considered finished, even though there is still a reordering phase left. This is an optional task, because output samples are not completely unordered, as they have a known sequence that can be considered as input for the next processing step.
In the following we describe in detail the basic elements used in both architectures: butterfly and rotators.
A. Butterfly
In pipelined architectures, butterflies can be implemented as a set of log 2 (R) stages of adders and constant multipliers. The simplest implementation of this element is the radix-2 butterfly, which only requires two components: an adder and a subtracter, both complex. In this case, the twiddle terms are trivial and, consequently, there is no need of extra components to perform the rotation after the mixture of samples.
For implementations with other radices, most samples do not need rotations since their corresponding twiddle is 0 ± , as is the case of radix-2 computation. Other rotations may be easy to compute, as is the case of radix 4 where the angle to rotate is ¡90 ± and can be implemented as a swap of the sample components with sign change. Other angles like ¡45 ± or ¡135 ± , which are present in radix-8 architectures, can be implemented by two multipliers by real constants. For radix R > 8 butterflies, nontrivial twiddles appear, with the number of these nontrivial twiddles increasing with the radix. In these cases complex multipliers must be used and the butterfly implementation requires bigger area. In fact, there is an exponential increase of area with the radix. In general, we can say that the area required by a radix-R butterfly is the area of the basic butterfly multiplied by R=2 ¢ log 2 (R). This value does not consider the area of the pipeline registers or the multipliers. This increase in area precludes the implementation of standard FFTs with radix larger than 8, even though higher radix values reduce the number of stages. This is not the case of the monobit FFT, which allows the use of higher radices because rotations become trivial.
Radix 4 presents the twofold advantage of including simple components and presenting a reduction of the number of stages when compared with radix-2 implementations. This combination of advantages makes this value of radix the optimum for most applications.
An additional important issue to consider in the butterfly structure is that for every new stage we should decide the bitwidth to manage the overflow. This is due to the adders included in the butterflies. If we keep the bitwidth fixed, the easiest approach is to truncate the output of the adders removing the less significant bit. This divides the output by 2. With this approach the final area is reduced, but the performance of the total FFT degradates. This fact is thoroughly analyzed in Sections IV and V.
B. Rotators
Rotators are critical components in the FFT architecture because they require a significant percentage of the total area. They are mainly composed of a first element that multiplies data by the twiddles and a memory that stores the twiddles. In our case we have implemented the rotator using the CORDIC algorithm [18] , which allows to perform the previously mentioned multiplication by the twiddles without multipliers. This algorithm performs the rotation of a complex vector by means of a series of shifts and additions. Every shift rotates the vector components a given angle from a set of elemental angles. This algorithm presents an intrinsic gain of approximately 1.647 [18] . Therefore, to keep the dynamic range of the input samples, this element would have to increase the data bitwidth by one bit. As was explained for the butterfly, this extra bit can be truncated after rotation takes place or it can be kept. It is important to remark that overflow is avoided in any case.
The CORDIC algorithm takes as input complex components, given by the butterfly, and performs the rotation of a given angle through several rotations using a set of well-defined angles (elemental angles). For the FFT, the angles (twiddle) are known in advance, so for our implementation instead of storing these angles, we compute the rotations of these elementary angles that the twiddle will produce (one bit per elemental angle), and store this information in memory. The amount of memory required is approximately the same as if we store the twiddle angles, but the implementation of the algorithm can be simplified, because now we do not need the hardware that controls the rotations. This implementation saves area and results in a higher clock frequency.
C. Feedback Architecture
The FB architecture, shown in Fig. 3 , is composed of a first memory that stores the input samples, followed by a butterfly whose output is connected to a single rotator that multiplies by the twiddles. In this implementation, given that the rotator is a shared component, part of the butterfly outputs will be fed back to the memories to allow the use of the rotator all the time. Therefore, there will be two working modes, depicted in Fig. 4 . A first mode is related to the arrival of samples from the previous stage while samples coming from previous processing are extracted from memory (mode 1). During mode 2 the samples are processed and simultaneously data coming from the butterfly are stored in the memory because the rotator is busy. This working procedure is illustrated in Fig. 4 , where we can see how two input sequences of 1024 samples (xy0 and xy1) enter the FB architecture and are sequentially processed by the memory, butterfly, and rotator of the first stage. A similar processing is accomplished for the lower order sub-DFTs generated through the different stages (xy0:1, xy0:2 :::). Output results are labelled as XY0.
It is important to remark that the 1024 samples need 1024 clock cycles to be processed.
An FB implementation of the FFT with radix R requires R ¡ 1 blocks of dual port memory to store samples both coming from the input or fed back from the butterfly. Following this structure, every memory is designed to store N=R samples with N being the samples coming from the previous processing phase. During the first processing mode (mode 1 in Fig. 4 ) data coming from the previous stage are loaded through one port while the other port is used to dump the currently processed data. When sample N(R ¡ 1)=R arrives, the second phase starts, dumping a piece of data from each block to be processed in the butterfly while FB data are loaded from the butterfly to be processed later in the rotator (mode 2). This particular memory management not only allows temporary storage of data, but also the reordering of samples to be processed in the current stage.
D. Feedforward Architecture
In the FF architecture the samples can go ahead once a stage is processed because there are several rotators (see Fig. 5 ). Following the memory a butterfly implements every low order DFT, and next an array of rotators (one per sub-DFT produced at the output) is required. Their outputs will feed the next stage in the chain. Actually, the performance provided by this architecture is R samples/clk, being R the radix. As can be seen in Fig. 6 , the concurrent execution performed by this architecture allows the processing of an N-point FFT in N=4 cycles because a radix 4 is used in this example (in the figure, a 1024-point FFT is processed in 256 cycles). In this case, since data coming from the different sub-DFTs are passed in parallel, a different memory scheme is necessary, because parallel storage is required together with a reordering of data. We have used a matrix memory of R £ R with ping-pong structure. As can be seen in Fig. 6 , a memory called A is used to store input samples while a second memory B extracts data to feed the butterfly. During the next time slot, memories A and B exchange their roles allowing continuous processing.
This architecture requires R ordered input samples. Moreover, the different sub-DFTs generated at the output are processed by the rotators in parallel, and sent as a block to the following stage.
Additionally, in this architecture every CORDIC element (R ¡ 1 in a radix R implementation) requires a memory with the sequence of rotations to perform per twiddle.
E. Overview of the Proposed Architectures
To summarize the description of the FFT architectures under study, Table I shows their requirements in terms of basic elements (memory, rotators, and butterflies expressed as adders) for different values of radix R. Memory size is measured by the number of samples that it holds. As can be seen, the resource requirements are higher for the FF architecture than for the FB one. This is due to 
Note: N is number of points and S = log R N is number of stages. the higher degree of parallelism presented by the FF architecture, which additionally provides better performance in terms of throughput. We should remember that the FB architecture is able to process a sample per clock cycle, while the FF architecture processes R samples per clock cycle, being R the radix.
IV. EXPERIMENTAL RESULTS
All experimental results have been obtained targeting Xilinx FPGAs, (in particular the VIRTEX-II xc2v4000-6), using as development environment Xilinx ISE 7.1.
The various proposed FFT architectures and the large parameter set that can be used for their configuration provide a very wide experimental outcome. To quickly explore all the possibilities offered by the different architectures we have developed a design exploration tool [13] , which provides support to select and configure from among the many parameters that characterize every single FFT architecture. Once a given architecture has been selected, the tool provides quick estimates on basic parameters and functions that help the user. The key parameters under analysis in all these architectures are:
1) bitwidth of the input samples, 2) number of points N, 3) number of stages S, 4) radix of the implementation R (power of 2), 5) stage scaling (truncation in the butterflies).
The performance of the different solutions is analyzed in terms of: 1) area (slices and memory blocks used 1 ), 2) latency, time to process an FFT (from start to end), 3) clock speed (MHz) and throughput (Msamples/s), 4) power.
In order to organize the analysis of the experimental results we will perform it in three different scenarios. The first and second scenarios are devoted to the analysis of FB and FF architectures for both conventional and monobit FFTs, respectively. The third scenario studies the power dissipation of all architectures.
Finally a last scenario is devoted to analyzing the results provided by the FPGA implementations of FFTs when used in a digital channelized receiver, and due to the importance of these results, Section V is devoted to this analysis. In the first scenario we present a comparative study of the implementation of the FB and FF architectures for the conventional FFT. Table II summarizes the synthesis results we have obtained for different implementations when exhaustively exploring the design space. In this way, we have generated 16, 64, 256, and 1024-point FFTs with both FB and FF architectures. The bitwidth was initially fixed to 8 and 16 bits and we considered both truncation and no truncation through the stages. Additionally, given that the FF architecture allows the parameterization of the radix, we have implemented radix 2, 4, and 8, while the FB architecture has been implemented for radix 4. Moreover, implementations with a large number of points have been generated with and without usage of BRAM.
In Figs. 7 and 8 we can see the area-performance trade-off that can be obtained for 8-bit designs in the FF and FB architectures with and without truncation. From a first analysis of Table II and these graphs we can draw the following conclusions.
1) It can be clearly seen that the FB architecture requires lower area than the FF implementation due to resource sharing, especially when computing the FFT with a high number of points. On the other hand, the FF architecture processes several samples in parallel, which leads to a higher throughput.
2) As expected, all implementations with no truncation of bits in intermediate stages always present bigger area and lower speed than the ones with truncation.
3) The influence of the radix can be analyzed studying the FF architecture. As can be seen in Table II , except for the 1024-point implementation, the area of architectures with radix 2 and 4 is more or less the same. However, performance results are completely different. Even though radix-2 implementations present a slightly higher working frequency, radix-4 implementations have a higher throughput (almost double), due to the higher parallelization of the operations. As is well known, the final speed obtained in an FPGA depends on the particular place and route that is carried out, which directly impacts on the length of the critical path. In this sense, it is interesting to remark that for larger transforms the resulting frequency is smaller, due to the higher complexity of interconnections.
4) The latency values of Table II refer to the total computation of an FFT. Furthermore, other computations can run simultaneously in the FPGA, which cannot be done in a DSP [19] .
5) The use of BRAM blocks has only been worthwhile in those examples that require large memory sizes: 512 and 1024-point FFTs. Actually, the FB architecture obtains clear benefits from the use of BRAMS, but the FF architecture can only take advantage of them for radix 2, when the required memory size is bigger than the threshold established to use BRAM instead of distributed memory. 2 
1) Comparative Study with Other Implementations:
There are only a few tools able to generate FFT cores. The Xilinx LogiCore is a well-known example [20] , but the degree of parameterization of this tool is significantly reduced when compared with our approach. Another interesting approach, SPIRAL [21] , is based on a different DFT architecture, the Pease FFT, which provides a parallelized implementation that can be considered in between the architectures presented here. We have generated with these tools different implementations of a 1024-point FFT with 16 input bits and no growth through the stages. In Table III , we can see the experimental results obtained from Xilinx Core Generator (Xilinx), our FB and FF architectures, and two experimental results from Spiral 2 We have considered in our results that BRAMs should be filled more than 50%, otherwise they could be used by other computations in the FPGAs. Nevertheless, the user of the FFT generation tool can modify this threshold as convenient. (Spiral P4 TH2 and P4 TH128, with different degree of paralellism).
As can be seen in Table III these architectures have very few common points and take advantage of the FPGA resources in different ways. All implementations make use of slices to implement logic functions, which could be a general measure of area. However, specific components can also be used, as is the case of BRAM memories or built-in multipliers, and this results in important slice savings. It is therefore necessary to have a uniform measure to compare the different area and performance results. We have decided to measure all area-related issues in slices, which is the only component present in all FPGA families. We should not forget that the idea is to implement the whole system on the FPGA, being therefore convenient to leave special purpose resources (RAM, multipliers:::) for other design modules. To carry out this measurement, we need to know the equivalence of built-in components into slices. In this sense, to establish the area required by a built-in multiplier, we have implemented with logic a 16 £ 16 pipelined multiplier, which occupies around 350 slices. We have not implemented the 18 £ 18 multiplier integrated in the Xilinx devices because it is not always fully exploited.
Regarding BRAMs, since they are large memory blocks that can be configured with different utilization, we have established a capacity value of 50% to find out the equivalent measure in slices. With this capacity value we have obtained an area use of 850 slices. Finally, the best way to qualify a given architecture is to consider not only area, but also performance. In this sense we have defined a new measure called KSsec/slice which provides a ratio between performance (Ksamples per second) and area (slices). 3 This new measure has been plotted in Fig. 9 to better compare all architectures. Xilinx implementations do an intensive use of BRAMs and multipliers, with the subsequent low count of slices. However, if we map these components into equivalent slices, the area measurement grows significantly (see column T SLICES in Table III) , but with a performance comparable to our FB implementation. Consequently, the metric shows that this implementation is characterized by a poor performance-area ratio. Regarding the Spiral architectures, SPIRAL P4 TH2 shows an intensive use of BRAM components with the corresponding reduced number of slices. The second architecture, SPIRAL P4 TH128, performs a more efficient BRAM 3 The clock frequency associated to a given implementation depends on the particular architecture; we have therefore considered that a measure based on the number of samples processed per second is more fair. Moreover, the cost associated to that performance is given by the logic required (number of slices in an FPGA). mapping. The performance-area ratio shows for the first architecture a similar behaviour to the Xilinx implementation, whereas the second implementation improves this ratio significantly. We can conclude that FF and FB architectures provide the best implementation option, together with one of the Spiral architectures. In this sense, the FB architecture presents the best results in terms of area and the FF architecture shows the best performance figures, while SPIRAL P4 TH128 provides a solution that can be placed in between.
B. Scenario 2. Analysis of the Monobit FFT Implementation
In this scenario we study the results obtained for the FB and FF architectures when implementing the monobit FFT. The monobit FFT [11, 12] simplifies the computation of the FFT kernel by approximating the rotations with simple angles, 0, ¼=2, ¼, or ¡¼=2, which makes unnecessary any multiplier or rotator. This results in a clear reduction of the final area and a significant improvement in speed. The monobit architectures have been implemented with different bitwidths of the input samples to observe the benefits in terms of area and performance obtained with the monobit simplification. Both truncation and no truncation options have been considered again. Table IV summarizes these results. Additionally, given that the FF architecture allows the parameterization of the radix, we have explored implementations with other radices, as can be seen in Table V.  If we compare Tables II and IV we can observe how the monobit implementations present a variable increase in speed and a clear reduction in area and latency. In particular, the area of FB architectures is reduced a 20% together with a 20% improvement in speed. Fig. 10 shows the results of the design exploration performed with our tool for conventional and monobit FFTs in different points of the design space.
All monobit FB architectures present a performance increase obtained by means of a higher clock frequency. This is due to the substitution of the complex rotator in the conventional FFT, which is in the critical path, by a monobit one, which is faster. On the other hand, the FF architecture hides these benefits because the limiting speed element is not the rotator but the memory. In terms of area, as expected, larger monobit FFTs save more area.
Regarding the radix variations, the benefits of the monobit implementation can be clearly seen in the case of 2 input bits (see Table V ). The FF monobit architecture allows implementations with higher radix (16, 32) , with the corresponding performance improvement. These results cannot be obtained by conventional FF FFTs. In this case the clock frequency is more or less the same, because the critical path is related to the complex memory structure, but the higher the radix, the higher the performance can be achieved (up to GS/s).
C. Scenario 3. Power Consumption
A key parameter in most data processing applications is the power consumption of the resulting implementation. It is due to two main reasons. First, the power density in current FPGAs may produce an uneven distribution of temperature on the surface of the device with the corresponding hot-spots. This may produce a malfunctioning of the particular device or even of the whole system. Second, many current systems may be battery powered, what makes the power dissipation a new design dimension to be considered during the design cycle. We have evaluated the power dissipation that our FFT architectures present with the XPower tool, from Xilinx, and represented it in Fig. 11 . As can be seen, the power dissipation of FB and FF architectures has been evaluated for different clock frequencies. It is well known that dynamic power is directly proportional to the frequency, as can be observed in that plot. Of higher interest is the comparison that can be carried out between FB and FF architectures and between conventional and monobit FFTs. As expected, the monobit implementation exhibits the lowest power dissipation, mainly due to its lower complexity. Moreover, when comparing FB and FF architectures, the FF architecture presents a higher power consumption due to its higher complexity.
As expected, the power consumption is directly related to the area of the implementation (including both logic and interconnection area). We should remark that most dynamic power in FPGAs is consumed in the interconnection resources (70%, as studied by [22] ). For FFTs of the same length, the FF architecture always presents greater power consumption than the FB one. Moreover, the monobit simplification obtained in terms of area can also be observed in terms of power. Even though the power results plotted in Fig. 11 are very high, recent advances in FPGA technology [23] include process and architecture innovations to reduce both static and dynamic power. For instance, the dynamic power consumption measured in the new Virtex-5 FPGAs presents a 55% reduction when compared with the previous implementation family (Virtex4). Therefore, we expect that the power consumption of the FB and FF architectures will be reduced orders of magnitude with respect to the values plotted in Fig. 11 , which correspond to VirtexII FPGAs, the family previous to Virtex4.
Regarding the influence of the input signals on the power consumption, we have evaluated our implementations with both Gaussian noise and sinusoids with different amplitudes, and we have observed that the power consumption is similar. The reason for this performance is that the activity rates of the input signals are in all cases very similar, due to the two's complement representation of the input signal.
V. ANALYSIS OF A CHANNELIZED RECEIVER PERFORMANCE BASED ON THE PROPOSED FFT ARCHITECTURES
The channelized receiver to be characterized is depicted in Fig. 12 . This receiver consists of an ADC followed by an FFT-based filter bank. Finally, a decision is taken applying a linear detector to the outputs of the filter bank. 4 We have studied this channelized receiver from the signal detection point of view, assuming that the input is a sinusoidal signal with its associated real, additive, white, Gaussian noise with standard deviation ¾ which is bandlimited to the Nyquist frequency. False alarm probability, detection probability, and dynamic range of several FFT implementations have been analyzed in order to determine their performance in terms of input data bitwidth, truncation along the FFT stages, radix, and number of points N. Throughout this section a rectangular window without normalization has been employed.
It might be important to note that the performance of the FB and FF implementations is identical because they are different hardware implementations of the same algorithm.
A. False Alarm Probability
The impact of some hardware design considerations (rotator error and butterfly truncation) on the false alarm probability is analyzed. All false alarm probability simulations have been obtained using the Monte Carlo method with 10 6 independent trials. The simulated false alarm probability P fa in these figures is obtained as an average of the probabilities through the channels of the FFT. All the channels in the FFT, except channels 0 and N=2, have the same theoretical statistics, and, therefore, the same theoretical P fa . However, the lowest channels in the implemented FFTs are more affected by the 4 From the detection point of view, both linear and quadratic detectors have the same sensitivity in this receiver [24] , although the threshold for the linear detector T L is the square root of the threshold for the quadratic detector T Q : T 2 L = T Q . From the implementation point of view, a linear detector implies a square root if we calculate the modulus following the straighforward definition. However, the Cordic algorithm can also be used to determine the modulus without multiplications and square roots. On the other hand, the number of bits to represent the modulus is half the number of bits to represent the output of a quadratic detector. rotator errors, which results in different statistics at their outputs (different means), and, therefore, higher thresholds must be used to fix P fa to the selected value in those channels.
The theoretical false alarm probability per channel for a linear detector is
where T is the threshold of the detector, N stands for the number of points of the FFT, and k represents the intrinsic global voltage gain due to the CORDIC rotators [18] , which depends on the number of stages.
1) Impact of Rotator Error:
The first design under study is a 1024-point FFT, radix 4, without truncation in the butterflies, and with an input bitwidth of 8 bits.
5 Fig. 13 shows the averaged false alarm probability per channel for different values of noise standard deviation ¾. When ¾¸5 the experimental and the theoretical results are quite similar. As ¾ decreases, the discrepancies between theory and implementations increase. An in-depth analysis of this problem has shown that the deviation of the theoretical and the simulated results is more significant for the lowest channels of the FFT. This is shown in Fig. 14, where we have reproduced the results for ¾ = 3 from Fig. 13 . We have also added the averaged false alarm probability per channel when channels 1 to 3 are not considered, which is very close to the theoretical one. The lowest channels have different statistics at their outputs, and, therefore, higher thresholds must be used to fix P fa to the selected value in these channels.
There are two reasons for this behavior: first, the modification of the statistics at the input of the FFT processor due to quantization noise of the ADC that is more significant for low input power noise (low ¾); second, it can be shown that the propagation of this quantization noise through an FFT architecture based on nonideal rotators has more influence on the lowest channels.
Radix. The rotator error effect is more pronounced for radix 2 than for radix 4. This is related to the fact that the number of rotators in a radix-2 architecture is almost double and consequently, there exist more sources of error.
Results for radix 2 and ¾ = 3 are depicted in Fig. 15 . It can be noted that the effects of the different statistics in the the first channels due to errors in the rotators are more pronounced: channels 1 to 15 must be eliminated in the calculations of the averaged P fa instead of channels 1 to 3 in Fig. 14 in order to follow an exponential-like false alarm probability.
Number of Points. When the number of points of the FFT increases, two circumstances must be considered. First, the FFT has more stages, so more rotators are used. Second, the difference between two rotation angles is smaller, so that more accurate rotations must be performed. Fig. 16 shows the P fa of a 4096-point FFT, without truncation, radix = 4, and input bitwidth of 8 bits. For ¾ = 10 the simulated P fa , not plotted in the figure for readability, is similar to the theoretical one, as it happens with the 1024-point FFT. However, for ¾ = 3 the rotator error affects the results of the FFT, and the effect in the lower channels is more deleterious than that in the 1024-point FFT (Fig. 14) .
2) Impact of the Butterfly Truncation: If an input bitwidth of 8 bits and a 1024-point FFT with butterfly truncation are considered, it must be realized that altogether 10 bits are removed through the FFT simulations (2 bits per stage for a radix-4 implementation), which leads to a lower performance compared with FFT implementations without truncation. For example, a P fa = 10
¡3 cannot be achieved for ¾ = 10. Thus, in order to study the influence of the butterfly truncations, a 1024-point FFT with 16 input bits has been chosen, which is a widely used architecture. Firstly, if ¾ is high, the P fa is similar to the theoretical one, so neither the butterfly nor the rotator truncations are significant. Fig. 17 shows the difference between the theoretical and the simulated P fa curves when ¾ = 256: 6 For lower values of ¾, the butterfly truncation modifies the probability density functions of the noise samples. As a consequence, big discrepancies between the simulated P fa and the theoretical one which was deduced for continuous Gaussian noise samples are observed.
B. Detection Probability and Dynamic Range
The impact of some hardware design considerations on detection probability and dynamic range are analyzed. All simulated detection probabilities have been obtained using Monte Carlo 6 The stepped shape of the simulated curve results from the quantization noise. simulations with 10 3 independent trials. Channels 0 and N=2 are not considered in the calculations because noise statistics are different in these channels.
The sensitivity S = A 2 min =2 is defined as the power of a sinusoid of amplitude A min at the input of the channelized digital receiver that assures certain detection probability P d for a fixed false alarm probability P fa . Thus, the minimum signal-to-noise ratio, SNR min , is the quotient between the sensitivity and the input noise power: SNR min = (A 2 min =2)=¾ 2 . The theoretical SNR min can be calculated from the required SNR at the input of the detector, SNR O , for given P d and P fa [24] , and the FFT processing gain. 7 Thus, the theoretical SNR min for a centered sinusoid and rectangular windowing can be obtained as
On the other hand, the maximum signal-to-noise ratio, SNR max , is the quotient between the most powerful input sinusoid without saturation at the output of an ADC with b bits and the input noise power:
Finally, the dynamic range, DR, is the quotient between the SNR max and the SNR min . It is a well-known result from radar detection theory [24] that an increase in the threshold will reduce the false alarm probability and will cause an increase in the required signal-to-noise ratio to get a fixed detection probability. This increase in the signal-to-noise ratio will also cause a reduction of the dynamic range. Fig. 18 shows a schematic diagram of the behaviour of our FFT implementations (SNR min and DR) versus the input noise standard deviation ¾. In this figure, three main regions can be distinguished. When a high value of ¾ is selected (region C), the SNR min remains constant (A min decreases when ¾ decreases) and is independent of ¾ as in the theoretical case. This happens because the quantization noise due to the ADC is not relevant for high values of ¾. Likewise, the DR follows the theoretical behavior: it increases 6 dB per octave. Therefore, the best performance is obtained for decreasing values of ¾.
On the contrary, in region A the quantization noise is appreciable because ¾ is low and both signal and noise may occupy a few quantization levels. Under these circumstances, the dynamic range is approximately constant because a significant reduction of ¾ hardly modifies A min , making the SNR min get worse. As a result, the higher ¾ is, the better the performance obtained.
The performance of some 1024-point FFT designs is depicted in Fig. 19 , which shows the SNR min and dynamic range in dB for a detection probability P d = 90% and P fa = 10 ¡3 depending on the input noise. 8 In order to compare architectures with input bitwidths of 8 and 16 bits, ¾ for the 16-bit architecture is normalized by 2 8 . Under these circumstances, SNR O min = 11 dB according to [24] , and, therefore, SNR min (dB) = ¡16:09 dB. Following 8 Channels with different noise statistics because the rotator error has not been included in the calculations. this, the theoretical DR is obtained as
From the analysis of this figure, important conclusions can be drawn. All FFT implementations follow the performance described in Fig. 18 , although there are some differences which are necessary to point out. All FFT implementations with the same number of points have the same performance for high values of ¾. Moreover, the minimum reachable sensitivity is equal to the theoretical one and only depends on the number of points of the FFT. Therefore, the SNR min decreases 3 dB and the dynamic increases 3 dB when the number of points is doubled. The range of ¾ where the analyzed system performs ideally is wider for the radix-4 architectures than for the radix-2 ones. The reason is the larger number of rotators used in the radix-2 designs, which results in an increase in the threshold to fix P fa , and, therefore, a decrease in the sensitivity. Additionally, radix-4 architectures almost always take up less area than the radix-2 ones, so radix 4 is usually the best choice.
The ratio in dB between the maximum amplitude at the input of an ADC without truncation and the quantization level for an ADC with b bits is 6 £ b dB. Thus, a maximum DR of 48 dB could be expected for an 8-bit FFT. However, as shown in Fig. 19 , a dynamic range of about 54 dB may be obtained with an 8-bit FFT without saturation. Consequently, signals with an amplitude lower than a quantization level can be detected. On the other hand, the results for a 16-bit FFT with truncation are better than the 8-bit implementations without truncation. However, it must be considered that a 16-bit FFT could achieve a DR higher than 6 £ b = 96 dB but, due to the butterfly truncation, it can only obtain a DR of 62 dB and the lowest detectable signal has an amplitude A min ¼ 25. Therefore, five of the less significant bits are wasted. As a design rule, it is more interesting to use an FFT with less bits and without truncation than an FFT with a large number of bits and truncation. 1) Monobit FFT: Rotators are the only difference between the implementations of the conventional and monobit FFTs. In the monobit FFT the rotations are accomplished by swapping the real and imaginary components of the signal and/or changing the sign of the components. Consequently, although the rotation angles are approximations to the ones in the FFT, there exists no error in the rotations, and the gain of the monobit rotator is always k = 1 in (4). However, the approximate rotations imply modifications in the coefficients of the filters obtained from the FFT algorithm, which results in high sidelobe levels [12] ; see Fig. 20 . The average of the highest sidelobe levels for the filters of the monobit FFT-based filter bank is 9-10 dB below the mainlobe, which cannot be improved by windowing [12] . This represents a serious limitation for the instantaneous dynamic range (the capability to detect simultaneous signals with different powers) of a channelized receiver [4, 12] .
On the other hand, the processing gain of the monobit FFT for centered sinusoids depends on the frequency bin and is always lower than that of the conventional FFT. Fig. 21 shows that there is a degradation of 1-3 dB in the processing gain for the case of 1024-point and radix 4 FFT. This degradation calls for an increase in the signal-to-noise ratio at the input of the channelized receiver to maintain the detection probability. This point was discussed in detail in [12] . Fig . 22 represents the sensitivity and the dynamic range for a 1024-point FFT and two frequency bins (bin 127 with processing gain G = 24 dB, and bin=128 with G = 26 dB). Two different implementations are analyzed: an FFT without truncation and 8-bit input bitwidth and an FFT with truncation and 16-bit input bitwidth. 9 The same comments for Fig. 19 apply here. However, additional features appear in the monobit implementations. On the one hand, the difference in the sensitivity for high ¾ is due to the nonconstant processing gains throughout the filter bank generated with the monobit FFT. On the other hand, the improvement in the dynamic range and the lower deterioration in the sensitivity compared with the conventional implementation, Fig. 19 , is related to the gain of the monobit rotator, k = 1, in (4) , and the elimination of the errors in these rotators.
VI. IMPLEMENTATION OF FFT-BASED DIGITAL CHANNELIZED RECEIVERS
Even though the basic implementation of digital channelized receivers is based on the FFT algorithm, additional elements are required. This section is devoted to the analysis and design of the whole system, because the configuration and implementation of all the elements involved can significantly influence the final performance of the receiver. In this sense, windowing or ADCs may play an important role in the system because some FPGA resources may be required for its implementation.
Other important issues in FFT-based digital channelized receivers such as windowing, data reuse in overlapped FFTs, coherent and noncoherent processing, detection, and automatic modulation classification of simultaneous broadband signals were deeply analyzed by the authors in [26] . Fig. 23 illustrates the way the FB architecture can be used in a digital channelized receiver. In this case, a 1024 radix-4 FB FFT without (with) truncation can process 251 (274) MS/s, which requires an input data rate limited to 250 MS/s. Given that the FB FFT can process a sample per clock cycle, the input flow could be digitized with a simple ADC (from x(t) to x(n)). However, two analog real signals x(t) and y(t) can be processed in parallel given that our implementations for the FFT are prepared for complex input data sequences. A deinterleaving stage is needed afterwards to split the FFT output and order the complex transforms (X(k), Y(k)).
10
Fig . 24 illustrates how the FF architecture can be used in a digital channelized receiver, in a similar way to the FB architecture but with a higher degree of parallelism. The FF FFT can process several samples per clock cycle, and the throughput depends basically on the radix. For a radix-4 implementation four samples must be input to the structure per clock cycle, as shown in Fig. 24 . The clock speed can be 222 MHz for both implementations with and without truncation, providing a processing speed of 888 MS/s. In this case, the input flow must be digitized with a fast ADC and a demultiplexor has to be used, in addition to a parallelization stage. With this, the input flow x(t) can provide the four digital samples required by the FF architecture x(n).
Finally, it is well known that the input/output data rate may be a serious bottleneck in current signal processing systems. In this sense, Xilinx FPGAs Virtex-II and Virtex-II Pro families can reach a maximum clock frecuency of 420 MHz. This clock frequency is also the limit for the frequency of the standard I/O on FGPAs (420 Mbit/s). However, it is possible to duplicate the I/O data rate to 820 Mbit/s using DDR signaling, which means that data can change on each edge of the clock, or differential signaling. The Virtex-4 family, with a maximum internal clock of 500 MHz, provides IO data rates that are even higher. This family provides up to 600 MHz for standard IO, and up to 1 Gbit/s with DDR and differential signaling.
With these I/O data rates it is not possible to interconnect directly an FPGA with a high performance ADC. For example, the ADC TS83102G0B [27] can produce samples of 10 bits up to 2 GS/s. In this case, we can solve this problem by using a demultiplexor (DMUX) between the ADC and the FPGA, for instance the AT84CS001, as was depicted in Fig. 24 . This component allows configurations of 1 : 4 (which results in 500 Mbit/s in each pin, being therefore necessary to use DDR and LVDS in the FPGA) or 1 : 8 (250 Mbit/s in each pin, data rate in the limit of the FPGA standard I/O). Actually, the last commercial version of this ADC integrates the DMUX to simplify the interface with FPGAs. For example the part AT84AS004 [27] is an ADC of 10 bits and 2 GS/s with and integrated DMUX of 1 : 4.
VII. CONCLUSIONS
We have presented a comparative study of parallel pipelined architectures of the FFT algorithm targeting FPGA devices for the implementation of digital channelized receivers. The in-depth exploration of the FFT design space has been carried out with the help of a developed automatic tool. Both the digital circuit designer point of view (where area, throughput, and latency are the main targets) and the system designer perpective (where signal processing capabilities and power consumption are the main concerns) are taken into account for a joint assessment. From our analysis we can conclude that the FF architecture offers the optimum throughput at the expense of a higher power consumption, which will be reduced in the new generation of FPGAs. On the other hand, the FB architecture is the optimum solution if the area and power requirements are critical. A monobit version of both architectures can improve area, throughput, and consumption with a degradation of the detection capabilities and instantaneous dynamic range.
The experimental results have confirmed that the FB architecture requires lower area than the FF architecture but the latter allows to parallelize samples, which increases the throughput. Consequently, FB structures can be used for FFTs with large number of points, while FF architectures are better suited for applications with hard real-time constraints. 
I. INTRODUCTION
The problem of detection, and localization of multiple ground targets using unattended acoustic sensors is complicated due to various factors. These include: variability and nonstationarity of source acoustic signatures, signal attenuation and fading effects as a function of range and Doppler, coherence loss due to environmental conditions and wind effects, near field and nonplane wave effects, and high level of acoustic clutter and interference. In addition, presence of multiple closely spaced targets that move in tight formations, e.g. staggered, abreast or single-file, further complicates the direction of arrival (DOA) estimation, data association, and localization processes. Clearly, optimum performance for detection and localization of multiple acoustic sources is 0018-9251/08/$25.00 c°2008 IEEE
