Quad-level bit-stream signal processing (BSSP) 
Introduction
In a system that utilizes bit-stream signal processing (BSSP) technique, digital signal processing (DSP) is performed using the oversampled bit-stream output from the sigma-delta modulator (SDM) directly without the use of decimators and interpolators. The hardware complexity of BSSP systems is potentially much lower than that of conventional multi-bit processing systems, giving rise to a low-cost and resource-efficient way of signal processing [1] [2] [3] [4] [5] [6] .
One way to increase the signal-to-noise-anddistortion ratio (SNDR) of the bit-stream is to utilize a multi-level quantizer. Quantizing the analog input into two, three, or four levels results in bi-, tri-, and quad-level BSSP processing elements with increasing hardware complexity. Such increase in resource utilization counteracts the benefit of elimination of decimators and interpolators of BSSP.
At the same time, modern FPGAs, including many low cost variants, contain highly efficient multi-bit multipliers or DSP blocks as hard macros. Unlike full custom application specific integrated circuit (ASIC) designs in which every transistor used will contribute to the overall silicon resource consumption, the use of such embedded DSP blocks in FPGAs does not consume any general logic resources, giving rise to a new dimension in system resource consumption trade-offs.
The goal of this paper is therefore to study the overall system tradeoffs when such quad-level BSSP blocks are utilized on FPGAs, taking into account the presence of embedded multi-bit DSP blocks.
In particular, two moderately sized applications -a digital phase lock loop (DPLL) and a quadrature phase-shift keying (QPSK) demodulator introduced in [1] -are implemented using efficient quad-level BSSP blocks on an FPGA. The architectures of each sub-module will be presented. Resource utilizations of their FPGA implementations and their signal-to-noise performance are also contrasted against conventional binary and tri-level realizations. Furthermore, resource utilizations of conventional multi-bit implementations of the two applications are also estimated on an FPGA with built-in highperformance DSP blocks, which serves to provide insights for system designers to understand design tradeoffs for employing quad-level BSSP on modern FPGAs. 
Bit-Stream Lowpass Filter (LPF)
A first-order bit-stream LPF [7] is depicted in 
As both the input [ ]
x n and output [ ] y n are 2-bit bit-streams, the two gain blocks in Fig. 2 can be implemented by multiplexers. To demonstrate the performance gain of the quad-level bit-stream LPF over the bi-and tri-level counterparts, simulation of the bi-, tri-and quad-level filters is carried out. These filters have a normalized cut-off frequency about 0.00186. The SNDR of the quad-level design is 65.2 dB while that of the tri-level and bi-level LPF are 62.5 dB and 53.6 dB, respectively. The SNDR is determined by the ratio of the output power of a sinusoid, at a normalized frequency of 0.00189 and with a unity amplitude, to the total noise power in the frequency band of interest. The over-sampling ratio (OSR) is 128.
Bit-Stream Numerically Controlled Oscillator (NCO)
To construct a bit-stream numerically controlled oscillator (NCO), a fixed frequency sigma-delta based oscillator will first be presented. As shown in Fig. 3 To construct an NCO, the feedback gain K of the two DSDMs is changed by K ∆ from the center
according to [1] . Simulation shows that the average SNDR of the quad-level NCO is 49.4 dB while that of the tri-level and bi-level design are 45.3 dB and 41.6 dB, respectively. The parameters used in the simulation are as follows:
Bit-Stream Divider
The block diagram of a bit-stream divider is shown in Fig. 4 . Let x denotes the average value of [ ] x n . The average output z of the quad-level bit-stream divider converges to / x y . 
Bits-Stream Square Root Circuit (SQRT)
The architecture of the quad-level SQRT is shown in Fig. 5 . The average output z of the quad-level bit-stream SQRT converges to the square root of x . 
Application Examples
In this section, two application examples, namely, a DPLL and a QPSK demodulator are described and the FPGA implementation results of the bi-, tri-and quad-level designs are presented for comparison. The circuits are implemented with the Xilinx Virtex-5 XC5VLX30 FPGA using the design tool ISE 9.1i.
DPLL
A Type-1 DPLL [8] is shown in Fig. 6 
QPSK Demodulator
The QPSK demodulator in [1] has been implemented using the proposed quad-level signal processing building blocks. The QPSK demodulator consists of the synchronization part and the phase detection part as depicted in [1] . The specification of this particular implementation for the bi-, tri-and quad-level designs is shown in Tables 2 & 3 . In Fig. 7 , the quad-level design achieves more well-defined constellation which leads to a better performance. The FPGA implementation results of the three bit-stream QPSK demodulators are shown in Table 4 . For a comparison on the hardware complexity of the bi-, tri and quad-level BSSP circuit modules, Table 5 shows the FPGA resource utilization of individual component in this particular QPSK demodulator realization. 
Discussion
In this paper, we have presented various quadlevel BSSP circuit modules which are extended from the conventional 1-bit designs. We have also compared the signal-to-noise performance and resources requirement of these components with existing bi-level and tri-level designs. In general, the quad-level implementations achieve better signal-to-noise performance than their bi-level and tri-level counterparts at the expense of higher circuit complexity. Due to the higher complexity of the quad-level bit-stream multiplier, for applications that require multiplier, much more FPGA resources (LUTs and FFs) are required in the quad-level case as shown in Table 1 , 4 and 5. Thus for the performance and complexity tradeoff, it seems that tri-level BSSP is the best amongst the three. Tri-level BSSP achieves significantly better signal-to-noise than bi-level BSSP with moderate increase in circuit complexity as compared with quad-level BSSP.
Comparing the BSSP approach with the conventional Nyquist rate approach targeted for FPGA implementation, the incorporation of DSP48E slices in Virtex-5 series allows very efficient multi-bit implementation of signal processing circuits. For example, when all the multipliers and accumulators are fitted into DSP48E elements, an equivalent eight-bit implementation of the Type-1 DPLL described in Section 3.1 only consumes 64 LUTs plus 8 DSP48E slices. It seems that the BSSP approach is not as resource-efficient as the conventional multibit approach when implemented in FPGA using the "free" DSP resources.
On the other hand, one of the advantages in BSSP is that the decimator and interpolator for the conventional Nyquist approach are not required. Depending on applications, the hardware resources for a decimator can be as low as 556 LUTs as in [9] or 2116 Virtex-4 slices as in [10] . As one decimator is required for each analog input and one interpolator is required for each analog output, the total amount of FPGA resources for decimator and interpolator can be large. Thus depending on the complexity of the final system, the BSSP approach can still be more resource-efficient than the conventional multi-bit approach when the number of LUTs in implementing the BSSP circuits is smaller than or comparable to that in the decimator and interpolator implementation for the Nyquist rate multi-bit approach. 
