Abstract-Reconfigurable Multiplier Blocks (ReMB) offer significant complexity reductions in multiple constant multiplications in time-multiplexed digital filters. In this paper the ReMB technique is employed in the implementation of a half-band 32-tap FIR filter on both Xilinx Virtex FPGA and UMC 0.18µm CMOS technologies. Reference designs have also been built by deploying standard time-multiplexed architectures and off-the-shelf Xilinx Core Generator system for the FPGA design. All designs are then compared for their area and delay figures. It is shown that, the ReMB technique can significantly reduce the area for the multiplier circuitry and the coefficient store, as well as reducing the delay.
INTRODUCTION
Digital filters are by nature multiply-add intensive. The methodology, techniques and procedures deployed in realizing the associated hardware that undertake these tasks have matured over the years and are predominantly bound on utilizing the standard multiplier and adder structures in either fully parallel or time-multiplexed resource-sharing architectures.
However, there still remains a lot of redundancy in the arithmetic circuits and their associated computations as there is little sharing of the low-level intermediate calculations.
The multiplier block approach has addressed this gap and has resulted in significant reduction in power, area and delay of the multiple constant multiplications in the fully-parallel structures [2] - [4] .
Time-multiplexed designs are more efficient in terms of the resources needed.
They time-share the available hardware, usually a multiply-accumulate block and a memory. FIR filters, IIR filters, filter-banks, poly-phase filters, adaptive filters can all be implemented as timemultiplexed structures. Figure 1 In recent years, the application of the multiplier blocks to the time-multiplexed digital filter designs was also studied [1] . It was shown that, the redundancy can be reduced and the resulting specialized multiplier design can be much more efficient in terms of area and computational complexity compared to the general-purpose multiplier with its associated coefficient store. This novel methodology was named Reconfigurable Multiplier Blocks (ReMB) [1] .
To apply the ReMB method to the time-multiplexing systems, the coefficient store and the general-purpose multiplier in Fig. 1(a) and (b) were replaced by a multiplier block, which generates all the coefficient products, and a multiplexer select the required one as depicted in Fig 2(a) . Initially, this method seems to incur redundancy due to wasting all the generated products but the selected one. However, it is shown that, by pushing the multiplexer deep into the multiplier block design, the redundancy can be reduced and the resulting specialized multiplier design can be more efficient in terms of area and computational complexity compared to the general-purpose multiplier plus the coefficient store [1] . Fig 2(b) shows a multiplier block that generates 784, 156, 600 and a multiplexer to select the desired coefficient product. It uses five adders and a 3-to-1 multiplexer. By pushing the multiplexing operation into the multiplier block, the same functionality can be implemented using three adders and three 2-to-1 multiplexers as given in Fig 2(c) .
Multiplier block (a) A multiplexer connected to an input of an adder (in this context, adder refers either to an adder, subtractor or an adder/subtractor) together form the basic structure of the reconfigurable multiplier blocks. The size of the multiplexer and the functionality of the adder depend on the platform and the design criteria.
In Fig 2 (b) and (c) each node (•) corresponds to and adder. The edges represent the inputs to the adder. The numbers given at each edge shows the multiple of the signal achieved by a left-shift. If it is negative, then that particular signal is subtracted. The italic numbers next to the nodes are the product(s) generated by those nodes.
The area of a 2-to-1 multiplexer is considerably smaller than a full-adder for CMOS VLSI implementation. Moreover, the unnecessary evaluations of the products are avoided.
Implementation of these circuits on Field Programmable Gate Arrays (FPGA) will benefit from the fixed resource FPGA environment. As an example, one of the most common FPGA platforms, the Xilinx Virtex device family contains Configurable Logic Blocks (CLB) with Look-Up Tables (LUT) to implement the combinational logic with up to four inputs. It also has dedicated circuitry around the LUT for fast addition and multiplication as shown in Fig 3(a) . By utilizing the dedicated circuitry, a full-adder can be implemented using one LUT as an XOR gate. However, it is also possible to fit the 2-to-1 multiplexer to the same LUT as in Fig 3(b) , reducing the area requirement by more than 50% for the design given in Fig 2(c) . In this paper, we apply ReMB technique to a 32-tap halfband FIR filter to demonstrate its benefits on both FPGA and ASIC implementations. Furthermore, we implement two reference designs of the same filter using standard timemultiplexed filter architectures.
For the FPGA implementation, we also and compare our design with the readily available, off-the-shelf implementation with Xilinx Core Generator system. For the ASIC implementation, the proposed and the reference designs are implemented in UMC 0.18um CMOS technology. Section 2 of the paper will give the design details of the reference filters and the ReMB filters for fixed-point implementation. Section 3 will discuss the implementation issues specific for FPGA and ASIC. The area and delay figures for all designs and comment on the savings achieved by ReMB technique are also reported. Section 4 will conclude the paper.
II. DESIGN DETAILS
We designed the 32-tap half-band FIR filter in Matlab. Due to the nature of the half-band filter, the coefficients are symmetric and every other coefficient is zero except the middle coefficient.
For the fixed-point implementation of this filter, we quantized the coefficients to 10-bits rounding the exact coefficients towards the nearest integer. Assumed data word-length is 16 bits. The main reason for the 10-bits coefficient word-length was the algorithm that generated the ReMB structure. A typical time-multiplexed TDL filter architecture, which is used as a reference design, is shown in Fig 5(a) . All the coefficients are stored in a coefficient memory and the incoming input samples are stored in an input memory. A simple controller operates to process one filter-tap per-cycle. In Fig 5(b) , only the distinct non-zero coefficients (their Fig 5(c) shows the proposed implementation of the filter using ReMB. The coefficient store and the general-purpose multiplier in Fig 5(b) are replaced with a ReMB structure that performs multiplication for the distinct coefficients stored in the coefficient memory. The complexity of the controller is kept same since it generates the same control signals as in Fig 5(b) .
The ReMB block used in the filter is shown in Fig 6. It is generated by an algorithm described in [1] . It comprises seven basic structures of the smallest size (a 2-to-1 multiplexer connected to one input of an adder). The coefficients of the filter are generated at the output of the basic-structure at layer 3 by selecting particular inputs of the multiplexers. Select signals are not shown on the diagram for simplicity. Din is the input signal to the block. Table 1 shows the set of select values required to produce each coefficient of the filter. In the table, basic structures are indexed from 0 to 6 starting from the top of layer 1 downward and then layer 2 and layer 3. The select value of '0' means that the top branch of the multiplexer is selected. An 'X' value means that particular basic structure is not involved in generating the coefficient.
A separate decoder is designed to produce these select signals by using the output of the main controller that was used to address the coefficient memory. 
The filters are implemented in VHDL using HDL Designer TM and synthesized using Leonardo Spectrum TM . The FPGA implementations are realized on a Virtex FPGA with model number XCV300BG432-4. They are Placed and Routed (PAR) using Xilinx ISE 5.2 software. Area and delay figures reported in this section for the FPGA designs are obtained after PAR.
Furthermore another reference design using off-the-shelf Xilinx Coregen TM software is also implemented from the parametrizable MAC FIR core (version. 3.0) [5] .
The ASIC implementations are targeted for the UMC 0.18um CMOS technology. They are not placed and routed and all the results reported here are obtained after synthesis. There is no quantization in the data-path of the any of the designs. The input data are 16-bits and the output data are 30-bits wide with full-precision. No pipelining is applied to the filters given in Fig 5. Critical path delays are reported for the full combinational logic in the multiply-and-accumulate circuits. However, the filter generated by Coregen TM is pipelined as the latency of the filter is more than the number of filter taps.
Table II also shows that, the coefficient memory and the input memory are not included to the area figures for the filters given in Fig 5. The area figure for the Coregen filter, on the other hand, includes the memory for coefficients but not the input data.
The effect of the increased controller complexity in Fig 5(b) can be observed in the area figures for both FPGA and ASIC implementations. However, the multiplexer in the multiply-and-accumulate path in Fig 5(b) only contributes to the area for the ASIC implementations. For the Virtex implementation, the components inside the dashed-line in Fig 5(b) and (c) can be fitted into one LUT, which in turn means the multiplexer comes free.
The area savings achieved by the ReMB technique for the FPGA and the ASIC implementations of this particular example is around 20%. The ReMB block given in Fig 6 is not optimal in the sense of the number of basic-structures [1] . A better ReMB design, which would share more intermediate partial-products, would increase the area savings.
The decrease in the critical path delay for the FPGA implementations is due to the reduced logic depth of the multiplier since the extra multiplexer stages between adders do not contribute to the delay. However, for the ASIC implementations, the critical path delay increased a little due to multiplexers. The reduced logic-depth of the adder network in the multiplier avoided a large increase in the delay. Reduced logic-depth also contributes to lower-power since less glitches are produced.
If the pipelining was considered, the delays associated with all implementations would be similar. Even then, the area of the ReMB filter would be smallest because of the addition of the same amount of latches or flip-flops to all of the filters.
IV. CONCLUSIONS
We have implemented a half-band 32-tap FIR filter using the ReMB technique and compared it with the reference designs for FPGA and ASIC implementations.
ReMB technique reduced area for both FPGA and ASIC implementations around 20 %.
The critical-path delay in the FPGA implementations is reduced due to efficient basic structure mapping and less logic depth in multiplier. However, the multiplexers resulted in a slight increase in the delay of ASIC implementation.
Pipelining the filter structure would reduce the delay of all circuits to be comparable with Coregen design. The ReMB filter would still be the smallest area if the pipelining was considered. 
