Abstract-This paper presents an efficient implementation of digit-serial/parallel multipliers on 4-input look-up table (LUT)-based field programmable gate arrays (FPGAs). This subset of FPGA devices hide individual gate delays and add important wiring delay. These two facts produce important changes over the theoretical advantages of each topology. Architectural transformations are applied to obtain topologies with minimum logic depth and where the maximum clock speed is limited by the FPGA technology. The main results of applying those transformations to the different multipliers have been quantified for Altera FLEX10K family, and the conclusions have been extrapolated to other FPGA families.
I. INTRODUCTION
Real-time signal processing hardware requires efficient multiplier units. However, each application demands a different sample rate. From speech to image or radar, a wide frequency range is required. In most of the technologies, a bit-parallel circuit is expensive: its cost in area is critical, and runs faster than the throughput required by the application. At this point, the bit-serial [1] , [2] , [3] or digit-serial [4] , [5] , [6] , [7] , [8] approaches become an important alternative. Furthermore, in FPGAtargeted applications, the serial stream of data matches better with the structure of such devices [9] . This paper presents a systematic study about the FPGA-implementation of digit-serial/parallel multipliers. In Section II different multiplier topologies are presented. In Section III, actual results are summarized, and different techniques to enhance the performance are evaluated. All the analyzed circuits have been implemented on an EPF10K50GC403-3 FPGA [10] . These prototypes compute 8-bit two's complemented words (data and coefficients), and digit-sizes of N = 1, 2, 4, and 8 bits. For clarity, the experimental part is divided into the following subsections. Section III-A reviews the conventional methods to pipeline serial/parallel multipliers (SPMs), and demonstrates that this technique results in inefficiency in some of the circuits. In these cases, a new alternative to pipelining is also proposed. In Section III-B, a mechanism of asynchronous clear or set of the registers is presented. It reduces the area of some bit-serial multiplier versions. Finally, a modified structure that diminishes the logic depth is explained in Section III-C.
II. SERIAL/PARALLEL MULTIPLIERS (SPMS)
SPMs are embedded in digital signal processing (DSP) blocks to compute the multiplication of a coefficient by data. The coefficient is expressed in parallel form while the data enters to the multiplier as a serial stream. From the FPGA implementation point of view, where logic is mapped into look-up tables (LUTs), there are two alternatives to compute single precision serial/parallel multiplication of two's complement numbers. The main difference between them is the way to process the sign-bit of the input data. In this paper, these two alternatives are named SPM-I and SPM-II. The first does not extend the sign bit of the input data [4] , [11] , [12] , [13] while the second does [4] , [5] , [14 ] - [18] . Their bit-serial circuits are shown in Fig. 1 .
The computational scheme of SPM-I can be used to design the double precision SPMs (DPSPMs) [6] , [7] , [19] . A bit-serial DPSPM is depicted in Fig. 2 . The goal of this circuit is to maintain the throughput without either adding the extra clock cycles to insert zeros, or extending the sign-bit. The digit-serial/parallel multipliers presented in previous section were implemented using an EPF10K50-3 FPGA [10] . Each multiplier family uses 8-bit data and coefficients. Their versions include digit-sizes of N = 1, 2, 4, and 8 bits. The place and route of the circuits was performed using the default options of the tools, excepting the indication of the wires that require fast carry lines [10] . Every circuit version was evaluated according to the following parameters: maximum clock frequency (fc); maximum propagation delay (Tpro); maximum sample frequency (fs); area (A); logic depth (LD); and finally, the area-time product (A 2 T ). In Table I , the implementation results of the SPM-I, SPM-II, and DPSPM are presented, and the first area-time figure can be obtained.
It is important to remark that some of the theoretical advantages of DPSPM circuits are hidden by an FPGA implementation, the technological framework selected in this work. For instance, theoretically the clock frequency of both SPM and DPSPM for the same digit-size should be identical. Thus DPSPMs should double the sample rate with respect to SPMs (because the former do not require to insert extra zeroes). However, in the implemented versions, SPMs run faster than DPSPMs. Thus the resulting sample rates of DPSPMs are just only a little bit higher (see Table I ). This effect is a consequence of the fixed structure of the selected FPGA: a matrix organization in which each element is a 4-input LUT. So, the circuits have to be divided into 4-input functions in order to be implemented. The logic depth of SPM and DPSPM is 2 LUTs and 3 LUTs, respectively, as can be seen in Fig. 3 . The logic depth increment in the DPSPM is caused by the extra PSCs. As a result, a throughput degradation with respect to the ideal case is produced.
Partitioning logic into LUTs also causes that both bit-serial circuits and N = 2 bits digit-size ones have the same logic depth. As a consequence, both versions could ideally achieve the same clock frequency. Table III gives an example of this effect: N = 2 DPSPM achieves higher clock rates than bit-serial one.
Considering that FPGA-vendors are permanently marketing chips with different LUT-size (named k), the optimal value of k that will allow this kind of topology to achieve a logic depth reduction is summarized in Table II for different versions of the circuit. The k value in the k-LUT column can be reduced in one unit for SPM-I and SPM-II circuits, if the device also incorporates dedicated logic to implement the synchronous reset of flip-flops. The case of Xilinx devices whose configurable logic elements (CLBs) consist of 4-input LUTs is a little bit different. On the one hand, FPGA families like Spartan II, Virtex, and Virtex II contain dedicated logic to perform synchronous reset of the flip-flop. On the other hand, these devices include dedicated multiplexors (called MUXFx) that allow combining several 4-input LUTs to implement functions with higher number of inputs inside a CLB. The XC4000 and Spartan devices can implement functions up to five inputs in one CLB, Spartan II and Virtex up to six inputs in one CLB, and Virtex II up to seven inputs in one CLB and eight inputs using two CLBs. Hence, bit-serial SPM-I and SPM-II multipliers will achieve minimum logic depth in such devices. Finally, pointing out that although newer Altera families (Apex and Mercury) also include the logic resources to perform the synchronous reset, it cannot be used to reduce the logic depth of the target circuits because these hardware resources cannot be used when LEs are configured in normal mode (that only can be used as one 4-input LUT), but in counter mode (configured as two 3-input functions within a LE).
A. Pipelining
The feedback loops present in the serial/parallel multipliers limit the application of pipelining: it can only be performed by registering the outputs of the partial-product generator block [20] . The implementation results of pipelining the previous circuits are shown in Table III . In most of the cases, the throughput is not improved with respect to the original versions. Bit-serial circuits (N = 1 bit) are the exception: they reduce the logic depth in one LUT, with no area penalization. On the contrary, pipelined DSMs (N > 1 bit), are larger and do not exhibit a logic depth reduction. This situation is repeated in every double precision multiplier version. For example, for N = 2 bits, an FPGA with 9-input LUTs would be necessary to get a speedup. When pipelining is applied in both A and B cutsets, bit-serial DPSPM circuits reaches the maximum clock frequency of the chip. The area is only incremented in 5 LEs. As final remark, the technique only is suitable for N = 1, where an effective logic depth reduction is achieved.
If pipelining were applied to implement these circuits on Xilinx devices we need not use the A cutset in bit-serial SPM-I and SPM-II topologies, because they already exhibit minimum logic depth (1 CLB). Furthermore, the A cutset would be required to achieve the minimum logic depth in bit-serial DPSPM. As in the case of Altera devices, it does not lead to any advantage to pipeline digit-serial versions of the multipliers, the logic depth remains constant.
B. Asynchronous CLEAR of FFs
The set and clear of the flip-flops (FFs) in serial/parallel multiplier implementations is conventionally performed asynchronously [14] , [19] , [16] . Considering that most commercial FPGAs include an asynchronous clear and set, this feature can be utilized to eliminate one input signal of each LUT (the RESET signal) in those devices that do not incorporate dedicated resources to perform the synchronous reset (XC4000 and Spartan of Xilinx and FLEX8K and FLEX10K of Altera). In this way, a logic depth reduction can be obtained, at the cost of one extra clock cycle to compute each word. The final balance between the potential speedup caused by a lower logic depth (and its corresponding wiring reduction), and the extra delay introduced by these additional clock cycles will depend on the chip model utilized to build the circuit.
From the previous topologies, only the bit-serial SPM-II can take full advantage of this idea. In this circuit, each slice consists of two 5-input functions and 2 FFs. It can be mapped using 3 LEs, having a logic depth of 2 LUTs. By using an asynchronous clear, each cell requires two 4-input functions (2 LEs), reducing the logic depth to just one LUT. The experimental results indicate that both logic depth and area requirements decrease (see Table IV ). Nevertheless, there is not an effective speed increment for the selected chip. The parameter T pro has been reduced, but the saturation frequency (125 MHz) has been reached.
The previous optimization cannot be obtained in the other circuits: its applicability will depend on the FPGA architecture. For example, the bit-serial SPM-I could be optimized if 5-input LUT FPGAs were available (the case of Xilinx FPGAs), meanwhile the 2-bit digit-size SPM-I and SPM-II would require 10-input and 9-input LUTs respectively to take advantage of this method. 
C. Modification of the DPSPM (MDPSPM)
The modified multiplier structure presented in [21] is based on the fast SPM proposed by R. Gnanasekaran [22] . The main idea is to avoid the W extra clock cycles required to complete the computation, by including a bit-parallel adder. Thus, the sum and carry vectors are computed in parallel after the first W cycles. The adder block replaces the W clock cycles needed to achieve the same operation serially. The DPSPM circuit proposed in [22] can be transformed in a double precision serial/parallel multiplier, simply by adding a PSC to the bit-parallel outputs of the RCA (Ripple Carry Adder). Commercial FPGAs usually allow the designer to build fast and small ripple-carry adders by using especial carry-chain lines [10] . Then, by modifying the circuit in such a way, a logic depth reduction can be achieved. These multipliers can be directly replaced by the MDPSPMs, without any circuit modification. The bit-serial version of this multiplier is shown in Fig. 4 .
The results of the FPGA implementation are reported in Table V . The only difference respect to the previous circuits is that the FAST logic synthesis option (the assignation of carry chain lines) has been used to map the bit-parallel adder. Thus, the MDPSPM topology achieves the best performance (Table V) . The speedup goes from 1.56 (for bit-serial version) to 1.15 (for bit-parallel version). The modified circuits have the same logic depth than the SPM ones. This is true for every digit-serial version, but they are one LUT smaller than the corresponding DPSPMs. As a result, they achieve a higher throughput than the conventional DPSPM with nearly the same cost in area. If MDPSPM are compared to the single precision circuits, the throughput improvement varies from 1.35 to 1.69 (for the bit-serial version), up to 2 (for the bitparallel version). Once again, the enhancement is obtained without incrementing the latency. The MDPSPM circuits lead to several changes in the optimized area-time figure of the multipliers.
In Table VI , the results of pipelining the MDPSPM are presented. As was remarked in Section III-A, pipelining the circuit after the partial product generation (cutset A in Fig. 4) increases the area by increments but does not reduce the logic depth. As a consequence, only the results for N = 1 and 2 bits are useful for custom DSP designers.
In the MDPSPM, pipelining can be extended to the RCA outputs (cutset B in Fig. 4) . Results for the three pipeline alternatives (A, B, and both A and B cutsets) are reported in Table VI .
The main result can be summarized as follows: by pipelining in point A and B, the bit-serial MDPSPM reaches the maximum frequency imposed by the process technology (125 MHz). The cost in area is minimum (just 3 extra LEs), but the penalty to be paid is an increase in latency of two more cycles. This structure modifies the area-time figure in the range 12 MHz to 15.5 MHz, saving 14 LE's (17%) with respect to the previous alternative.
Finally, it is important to note that this modification should not be applied to Xilinx FPGAs: it only adds an enhancement in bit-serial and 2-bit digit-size multipliers, and these circuits directly achieve the minimum logic depth (maximum speed) or can easily achieve it by pipelining, as shown in previous section.
IV. CONCLUSIONS
This work presented a systematic study of the FPGA-implementation of digit-serial/parallel multipliers. The target technology has been a k = 4 LUT-based FPGA, but optimal results have been extended to other LUT sizes. Three types of serial/parallel multipliers (two of single precision, and one of double precision) have been evaluated. Pipelining has been applied to extend the speed of each class of multiplier. Several methods have been proposed to obtain a logic depth reduction, obtaining the following conclusion:
Conventional pipelining only leads to a logic depth reduction in bitserial circuits. It is not a suitable technique in digit-serial SPM and DPSPM circuits. Minimum logic depth is achieved in bit-serial DPSPM if an extra pipelining (cutset B) is applied.
In the bit-serial SPM-II, the asynchronous clear of the FFs reduces the logic depth in one LUT and the area in W LEs.
The proposed modification of the DPSPM improves the performance for moderate word-lengths.
I. INTRODUCTION
The fast Fourier transform (FFT) plays an important role in the design and implementation of discrete-time signal processing algorithms and systems. In recent years, motivated by the emerging applications in the modern digital communication systems and television terrestrial broadcasting systems, there has been tremendous growth in the design of high-performance dedicated FFT processors [1] , [2] . Pipelined FFT processor is a class of real-time FFT architectures characterized by continuous processing of the input data which, for the reason of the transmission economy, usually arrives in the word sequential format. However, the FFT operation is very communication intensive which calls for spatially global interconnection. Therefore, much effort on the design of FFT processors focuses on how to efficiently map the FFT algorithm to the hardware to accommodate the serial input for computation. This paper presents a novel FFT implementation based on the use of digit-serial arithmetic which can lead to very efficient architectures. 
Here Since the data sequence x(n) arrives sequentially, the parallel data flow graph has to be projected along the order of input sequence in order to obtain efficient pipeline architectures. As (2) shows, each stage of FFT computation consists of retrieving the data F0(k); F1(k); F2(k); F3(k) for specific k, and the corresponding twiddle factor multiplication, followed by the multiplication of the radix-4 butterfly matrix. Direct implementation of (2) requires three multipliers to perform the twiddle factor multiplication as shown in Fig. 1(a) [1] , [3] . Here, the commutator is used to generate the proper data sequence for the following twiddle factor multiplication by swapping/exchanging the output data coming from the previous stage. The salient feature of this feedforward approach is that the trivial factor W 0 N (=1) in the twiddle matrix can be reflected in the hardware. However, unless four input data are sampled in parallel, this architecture cannot achieve full efficiency. For most of the applications where FFT processor must be interfaced to a continuous word serial stream, it is only possible to achieve 25% hardware utilization as there is a 4 : 1 mismatch between the bandwidth of input data rate and that of the processor. (In general, the utilization for radix-r butterfly unit is 1=r.) In order to compensate this mismatch, a fully utilized architecture based on the use of digit-serial arithmetic units has been proposed in [4] .
The other way of implementing (2) is to use a single multiplier for the twiddle factor multiplication as shown in Fig. 1(b) Fig. 1(a) , this scheme generates each element
