Abstract-Due to an increasing demand for on-sensor biosignal processing in wireless ambulatory applications, it is crucial to reduce the power consumption and hardware cost of the signal processing units. Discrete Wavelet Transform (DWT) is very popular tool in artifact removal, detection and compression for time-frequency analysis of biosignals and can be implemented as two-branch filter bank. This work proposes a new, completely multiplier free filter architecture for implementing Daubechies wavelets which targets Field-Programmable-Gate-Array (FPGA) technologies by replacing multipliers with Reconfigurable Multiplier Blocks (ReMBs). The results have shown that the proposed technique reduces the hardware complexity by 25% in terms of Look-Up Table ( LUT) count and can be used in low-cost embedded platforms for ambulatory physiological signal monitoring and analysis.
I. INTRODUCTION
Biomedical signals are known to be non-stationary thus their analysis requires both local and spatial information in order to maintain fidelity of the signal during signal processing such as artefact removal and detection. Discrete Wavelet Transform (DWT) performs signal decomposition via translated and dilated versions of a basis function which effectively localizes signal in both time and frequency domain. DWT can be realized by two-channel quadrature mirror filter banks with lowpass filter (h 0 ) and highpass filter (h 1 ) as shown in Fig. 1 [1] . The multiresolution analysis is achieved by recursive application of the filter bank on the lowpass filtered output. The output from each filter is downsampled by 2 where outputs at lowpass and highpass branches are known as the approximation coefficients (cA m [n] ) and detail coefficients (cD m [n]) covering spectrum below and above half sampling frequency, respectively. Fig. 1 demonstrates implementation of a three level analysis filter bank. Among wide range of wavelet families, Daubechies wavelets are popular choices where the Daubechies-4 (db4) wavelet, with four vanishing moments and 8-taps, has been used in many different biomedical signal processing applications [2] - [4] . Daubechies wavelets lead to orthogonal filter banks in which the lowpass and highpass filters are non-symmetric with equal filter lengths, real and fixed coefficients. They are known to be maximally flat filters with dyadic coefficients that can be represented with fixed point arithmetic without significant loss of accuracy. Thus, they can be implemented as signed binary numbers that can reduce the hardware complexity and power consumption.
Biosignal processing tools such as filters and discrete transforms employ constant multiplications that can be implemented as shift-add operations to reduce the hardware complexity and cost of the medical systems. For example, in [5] , [6] , authors presented a decimation filter chain for ElectroCardioGram (ECG) acquisition systems which replaces constant multiplications with shift-add operations. DWT also employs fixed coefficient filters associated with a selected mother wavelet, hence it can benefit from shift-add network topologies. Several studies, mainly in the area of image processing, used these networks to implement wavelet filters including 5/3, 9/7 lifting-based wavelet [7] - [9] , and 4-and 6-tap Daubechies filters [10] . However, to the best of author's knowledge, use of the Reconfigurable Multiplier Block (ReMB) [11] for implementing wavelet filters has not been investigated in the biomedical signal processing literature. This paper presents a hardware efficient ReMB structure and its Field-Programmable-Gate-Array (FPGA) implementation that can be employed in time-multiplexed filter structures for wavelet analysis of biomedical signals, suitable for low-cost portable medical devices. This design is based on efficiently employing dedicated resources of FPGA as presented in [11] , [12] , however with an extension for taking advantage of the new FPGA technology. Section II introduces the ReMB method followed by the fixed-point db4 coefficients quantization considerations, design details and implemented structure of the proposed ReMB. Section III compares the resource utilization figures of the proposed design and a general purpose multiplier. Finally, Section IV presents the drawn conclusions.
II. METHOD
For this study, ReMB design methodology introduced in [11] is extended for recent FPGAs which replace 4 input Look-Up Tables (LUTs) with 6 input ones. A 1-bit full adder/subtracter can be implemented using the dedicated carry-chain logic and an LUT for the remaining XOR gate. The concept of reconfigurability in multiplier blocks employs a multiplexer where its output is connected to at least one input of the adder. The multiplexer can be implemented using the unused pins of the LUT that is used for an XOR gate. Demirsoy et al. [11] introduced an ReMB algorithm which maximizes the use of FPGA logic elements by adapting the "basic graph structure". A basic structure, as shown in Fig. 2 (a), is simply a two input adder with at least one of its inputs connected to a 2:1 multiplexer that can be implemented with a 4-input LUT. Due to the dedicated resources of FPGA, adders in basic structures are implemented as ripple-carry adders. In this work, the basic structures are modified using new 6-input LUTs which enables replacement of 2:1 multiplexers with 3:1 ones for no additional cost and are demonstrated in Fig. 2 (b). Inputs of these muxes can be connected to the input of the ReMB or to the output of another basic structure or to ground. In order to implement a set of coefficients, a number of these basic structures can be interconnected in chain (i.e. horizontally cascaded) and tree forms (i.e. inputs of a mux connected to the output of another basic structure). Number of generated coefficients at the output is dependent on basic structure topology, number of basic structures and how they are interconnected. For example, if two basic structures given in Fig. 2 (b) are interconnected (both with 3 different outputs) then the output set size is equivalent to 9 (3 × 3). To find a valid ReMB design for an aimed coefficient set, it is critical to realize required depth of the design, and adder depth of each coefficient. Depth of a design represents number of required cascaded stages to obtain required number of coefficients and adder depth represents the number of cascaded adders required on each path between the input and the output nodes to generate each coefficient. Thus, following these requisites, ReMB depth can be generalized using Eqn. 1.
where is the maximum adder depth and is the number of cascaded basic structures (i.e. layers). 
A. Coefficient Quantization
Accuracy of DWT depends on precision of decomposition and reconstruction filter coefficients. Quantization of floating point coefficients results in quantization error which accumulates as it propagates through the filter bank. In order to evaluate finite-word length effects on an input data, filter coefficients associated with the db4 mother wavelet are quantized with various precision and employed in DWT. ECG and ElectroEncephaloGram (EEG) signals obtained from Physionet [13] are selected as reference signals for evaluation. The error variance between input and reconstructed data, with floating point and fixed point filter coefficients are measured. Fig. 3 demonstrates measured error values using different coefficient word-lengths for both ECG (blue) and EEG (red). Approximately -70 dB error variance can be observed with coefficient word-length of 11 bits (10 fractional bits) which is decided to be negligible for this study as such error is not observable during time domain analysis by clinicians. The aforementioned coefficients are also scaled with 2 10 in order to have integer values and their absolute values are given in Table I in the column denoted as ℤ.
B. ReMb Design Details
The db4 filters are implemented using ReMB designed by using the basic structures given in Fig 2 (b) . There are eight distinct coefficients for both highpass and lowpass filters, which simplyfies ReMB design procedure. Before starting the design, it is critical to evaluate adder depth of individual coefficients and the required number of layers needed to generate eight coefficients. One basic structure as in Fig.  2 (b) can generate three distinct values (i.e. its depth is three). Interconnecting two of these structures as a chain can generate maximum of nine distinct integers whereas three interconnected basic structures in a tree form with two layers can generate 27 integers at most. Therefore, the number of layers required for eight coefficients with the proposed basic structures is two. In addition, maximum adder depth of the coefficients is two, however three basic structures are required to realize all coefficients. Therefore, according to Eqn. 1, ℎ is calculated as two with three basic structures interconnected in a tree form. Fig. 4 demonstrates the ReMB designed accordingly which employs three adders, one 2:1 and three 3:1 multiplexers. The main controller addresses the required select values to generate coefficients in correct order. Select values for each mux (M0:M3) in Fig. 4 are denoted as S0:S3, respectively and are presented in Table I where '0' selects top, '1' selects bottom (for 2:1 mux) or middle (for 3:1 mux) and finally '2' selects bottom input of a 3:1 mux. In addition, the control lines for each adder (A0:A2) are similarly denoted as Sa0:Sa2 and the values '0' and '1' implement an addition and a subtraction operation, respectively. The proposed ReMB is targeted for FPGA platforms which takes the advantage of using the dedicated fast carry logic and implements multiplexers with no additional cost, as described before. However, when non-FPGA technologies are targeted, then ReMB can be redesigned with increased flexibility of using larger muxes. In FPGA platforms, resource cost for an individual mux with more than three inputs is comparable to an adder's. For instance, implementation of both 4-bit 4:1 mux and 4-bit full adder, utilize four LUTs each, whereas for non-FPGA technologies, multiplexer cost is relatively cheaper [14] . 
III. EXPERIMENTAL RESULTS AND COMPARISONS
For this study, two time-multiplexed Time-Delay Line (TDL) implementations are realized in order to compare the resource utilization of a conventional reference design with a general purpose multiplier and a design with ReMB which are demonstrated in Fig. 5 (a) and (b) , respectively. The reference time-multiplexed FIR filter structure is comprised of an input memory and a coefficient memory, and a single MultiplyACcumulate (MAC) unit with a general purpose multiplier. Such filter structure operates sequentially. At every cycle, incoming data is multiplied with one coefficient stored in memory and this process is controlled with a simple control unit. Each generated product is accumulated with the previous one by using an accumulator and a register.
On the other hand, the proposed multiplier block generates absolute values of the required filter coefficients and thus, replaces coefficient memory and general purpose multiplier of the reference design. As it can be seen from Fig. 5 (b) , a multiplexer is placed after the ReMB which is responsible for selecting between generated coefficient or its complement. Here, the controller is responsible for addressing the correct coefficient for each tap by generating correct control lines for multiplexers and adders/subtracters employed in the ReMB as well as the multiplexer after it (given in Section II).
For all experiments, filter architectures are designed using System Generator for DSP in MATLAB Simulink environment and are implemented on Kintex-7 FPGA with Vivado v16.2. The resource utilization for each aforementioned architecture after implementation is demonstrated in Table II in terms of LUTs, and Flip-Flops (FFs) and compared. In addition, the critical path delay for the multiplier and the ReMB are demonstrated in terms of adder and multiplier operation times, indicated using and , respectively. Looking at Table II , it can be observed that resource utilization for the proposed Fig. 5 (a In FPGA implementations multiplexer delays are not included in path delay since they are embedded into LUTs, therefore it is only critical to consider the logic depth of the adders. The proposed design has a low logical depth since the adder depth is two, compared to the general multiplier which will reduce critical path delay for the multiplication operation.
Filter resource utilization after implementation

IV. CONCLUSION
In this paper hardware efficient implementation for the db4 wavelet and scaling filters are presented that employs a specifically designed ReMB. It is shown that the addition of multiplexers into shift-add networks provides reconfigurability to well known constant multiplication blocks. By taking the advantage of recent FPGA technologies having 6-input LUTs, 3:1 muxes are employed in the design of ReMBs at no additional hardware cost which updates the techniques proposed in the state of the art. In order to evaluate resource efficiency of the proposed structure, it is implemented on a Kintex-7 FPGA and is compared to reference designs. As the results reported in this paper demonstrate, the proposed ReMB can decrease overall hardware cost of a time multiplexed filter by 25% compared to a general purpose multiplier. The lowcost and hardware efficient structure of the proposed multiplier is suitable for DWT filter banks and can be used in lowcost embedded platforms for ambulatory physiological signal monitoring and analysis.
