The effectiveness of implementing bit-stream signal processing (BSSP) multiplier circuits in FPGAs, in terms of hardware resources and clock frequency, is presented. In particular, the result of realizing BSSP multipliers on FPGA architectures that utilize 6-input lookup tables (LUTs) is compared against architectures that utilize 4-input LUTs. It is found that architectures featuring 6-input LUTs suit well in BSSP applications where wide combinatorial paths are common. Furthermore, the performance of a BSSP multiplier is compared against conventional parallel multipliers in terms of LUT resource requirements. For a given resource requirement, it is found that an over-sampling ratio of less than 32 is required for a BSSP multiplier to outperform its parallel counterpart.
INTRODUCTION
Sigma-delta modulators (SDMs) are widely used to build analog-to-digital (A/D) and digital-to-analog (D/A) converters due to their simple architectures and good tolerance to analog component inaccuracy [1] . Conventional digital signal processors (DSPs) operate at the Nyquist rate while SDMs always generate over-sampled data. Decimators and interpolators must therefore be inserted before and after the DSPs for sampling rate conversion in order to interface with these over-sampled SDMs. The inclusion of decimators and interpolators inevitably introduce extra logics and routing resource consumptions. To allow for a resource-efficient way of signal processing, digital circuits that directly process the over-sampled bit-stream signal from the SDM output have been developed [2, 3, 4, 5, 6, 7, 8] . This technique is referred to as bit-stream signal processing (BSSP) [2] . In this paper, we evaluate the performance gain in implementing bitstream multipliers using FPGAs with a 6-input LUT architecture over those with a 4-input LUT, in terms of resource utilization and speed. The contributions of this work are:
1. We present the detailed architectural design of the efficient bi-level bit-stream multiplier in [6] showing how the new 6-input LUTs FPGA architecture can be utilized to achieve compact and high-speed implementation.
2. We compare the result of implementing BSSP multipliers using a Xilinx Virtex-4 and a Xilinx Virtex-5 device to study the advantages due to FPGA architectural change.
3. We compare BSSP multipliers with traditional multibit multipliers to obtain conditions under which BSSP technique becomes more efficient than traditional Nyquist rate approach in terms of hardware resource consumption.
The rest of the paper is organized as follows: In Section 2, we describe our FPGA implementations of the bi-level and tri-level bit-stream multipliers. In Section 3, we present our implementation results and performance analysis. Finally, we conclude the paper in Section 4. Figure 1 shows the conventional bi-level bit-stream multiplier [2] . An efficient FPGA implementation of the bitstream multiplier utilizing 4-input adder structure is recently proposed in [6] . Here, we provide the detailed architectural design of the bit-stream multiplier. A 4-input adder structure is shown in Figure 2 . Using error feedback, it computes the average of four input bit-stream signals through the following equation:
BIT-STREAM MULTIPLIERS
where bit-stream adders. An example is indicated by the dashed box in Figure 1 . Referring to the same figure, the use of 4-input adders results in only two layers of adders (instead of four) with the top level being fed by the sub-products
When used to implement a bit-stream multiplier, the special nature of the inputs (the sub-products x[i]y [j] ) to the top layer of the 4-input bit-stream adders actually gives rise to an efficient implementation of these adders, which is described below.
Denote, for simplicity sake, the four inputs to the bitstream adder in the top layer as [1] , where ⊕ denotes the XOR logic operation and the number in the bracket of each term denotes the number of unit delays. That is, instead of writing
Now we prove that the least significant bit (LSB) of the 2-bit truncation error s [n] [1] ), the XOR operation among them can thus be merged into the LUTs of the 4-input bitstream adder. As a result, the overall design of the bit-stream multiplier now consists of two types of 4-input bits-stream adders, denoted as Types I and II, as shown in Figure 3 . Type-I adder is the merged-XOR structure in the top layer and Type-II adder is the original structure. There are four Type-I blocks and one Type-II block. Including the six shift registers at the input, this structure requires a total of eleven LUTs and twelve FFs, as is verified in Section 3
Tri-level Bit-stream Multiplier
The structure of a tri-level bit-stream multiplier, which was detailed in [7] , is replicated in Figure 4 for easy comparison. Comparing Figure 4 with Figure 1 , it can easily be seen that the sturcture of bi-level and tri-level bit-stream multipliers are very similar. Similar to its bi-level counter part, a tri-level bit-stream multiplier is also constructed using two types of major components: (a) tri-level bit-stream adders; and (b) tri-level digit multipliers, which will be briefly described below.
A tri-level adder has the structure shown in Figure 5 . Unlike the case for the bi-level design, a tri-level 4-input adder cannot be efficiently implemented. Since each signal consists of two bits, a tri-level 4-input adder adds five 2-bit signals (four 2-bit inputs plus a 2-bit feedback). Implementing these logic functions requires multi-level 6-input LUTs. The adder tree of the tri-level bit-stream multiplier just follow the 2-input bit-stream adder tree structure shown in Figure 1 . As shown in Figure 5 , the outputs of a tri-level bit-stream adder are functions of up to five inputs. As a result, the tri-level bit-stream multiplier can benefit from the Virtex-5 6-input LUT architecture. We just let the synthesizer to optimize the wide-input combinatory logic consisting of the tri-level bit-stream adder tree and the tri-level digit multipliers.
A tri-level digit multiplier implements the following logic 
where z 1 z 0 denots the 2-bit output while x 1 x 0 and y 1 y 0 denote the 2-bit inputs.
IMPLEMENTATION RESULTS AND DISCUSSION

Virtex-4 vs Virtex-5
The bit-stream multipliers are implemented with Xilinx Virtex-5 XC5VLX30 and Virtex-4 XC4VLX25 using the design tool ISE 9.1i. Table I presents the implementation results for the bi-level and tri-level bit-stream multipliers. We can see that moving from the 4-input LUT architecture (Virtex-4) to the 6-input LUT architecture (Virtex-5), both bitstream multipliers show resource savings on LUTs and higher clock speed. This means that our multiplier designs can take advantage of the new 6-input LUT feature. The effect of LUT reduction and speed-up on the bi-level bit-stream multiplier is greater than that of the tri-level one. This is due to the use of the 4-input bit-stream adder shown in Figure 2 . As explained in Section 2.1, the 4-input bitstream adder (Type-I adder) and its variant, Type-II adder can be efficiently mapped onto one level of 6-input LUTs. This allows for the implementation of the bi-level bit-stream multiplier using only two levels of LUTs, thus achieving a very high speed. In contrast, on Virtex-4 platform, Type-I and -II adders must be split among 4-input LUTs and hence more LUTs are required and poorer speed performance is observed.
For the tri-level bit-stream multiplier, only 2-input bitstream adder tree structure is implemented. This results in a higher logic complexity and slower speed than the bilevel design. As the multiplier consists of combinatory logic functions of over four inputs (tri-level bit-stream adders), the logic mapping in 6-input LUTs is more efficient than in 4-input LUTs. This is confirmed in Table 1 when the implementation results for Virtex-4 and -5 are contrasted.
Bit-stream vs Multi-bit Multiplication
We compare the FPGA resource requirements for bit-stream and multi-bit multipliers to investigate the condition on over- Figure 6 shows the noise power verus OSR plot for the bi-level and tri-level bit-stream multiplers. The graph is obtained using Matlab simulation. The quantization noise of an n-bit multiplier is approximated in [2] as:
We use Xilinx CORE generator to implement multipliers with various bit-lengths on Virtex-4 and -5 devices. The results are tabulated in Table 2 .
Comparing the LUT results in Table 1 with Table 2 , for the case of bi-level bit-stream multiplier implemented on a Virtex-4 device, we see that the 26 LUTs sits between the LUT resources of a 4-bit and a 5-bit multiplier. According to Equation 3, the quantization noise power of a 5-bit multiplier is about -35 dB. From Figure 6 , this can be achieved when the OSR is about 32. The "break-even" OSRs, i.e., the OSR beyond which a bit-stream multiplier becomes more efficient than a multi-bit multiplier, of the other three cases in Table 1 can be similarly obtained. The results are summarized in Table 3 . From Table 2 , it can be found that the LUT results on multi-bit multiplier implementation do not varies significantly when Virtex-5 is used instead of Virtex-4. In contrast, for bit-stream multipliers, the Virtex-5 implementation is more efficient than the Virtex-4 implementation. Therefore, we can see that the break-even OSR for Virtex-5 occurs early than that for Virtex-4 and we can conclude that BSSP in Virtex-5 performs better than in Virtex-4. Note that for bilevel bit-stream multiplier implementations, the break-even OSRs are below 32. In practical applications, the OSR is almost always higher than 32.
CONCLUSION
We have implemented bi-level and tri-level bit-stream multipliers in Virtex-4 and Virtex-5 to study the performance gain of BSSP arithmetic circuits in the new 6-input LUT architecture over the 4-input LUT architecture. It has been shown that BSSP implementation is more effective in FPGA featuring 6-input LUTs in terms of resource utilization and clock speed. We have also found the OSRs for the bi-level and tri-level bit-stream multipliers above which BSSP becomes more resource-efficient than traditional Nyquist rate multi-bit operations.
