



Area and Power Efficient Implementation of db4 Wavelet Filter 
Banks for ECG Applications Using Reconfigurable Multiplier 
Blocks
Eminaga, Y., Coskun, A. and Kale, I.
 
This is a copy of the author’s accepted version of a paper subsequently published in the 
proceedings of the 4th International Conference on Frontiers of Signal Processing 
(ICFSP 2018), Poitiers, France, 24 - 27 Sep 2018.
The final published version is available online at:
https://doi.org/10.1109/ICFSP.2018.8552046
© 2018 IEEE . Personal use of this material is permitted. Permission from IEEE must be 
obtained for all other uses, in any current or future media, including 
reprinting/republishing this material for advertising or promotional purposes, creating 
new collective works, for resale or redistribution to servers or lists, or reuse of any 
copyrighted component of this work in other works.
The WestminsterResearch online digital archive at the University of Westminster aims to make the 
research output of the University available to a wider audience. Copyright and Moral Rights remain 
with the authors and/or copyright owners.
Whilst further distribution of specific materials from within this archive is forbidden, you may freely 
distribute the URL of WestminsterResearch: ((http://westminsterresearch.wmin.ac.uk/).
In case of abuse or copyright appearing without permission e-mail repository@westminster.ac.uk
Area and Power Efficient Implementation of db4
Wavelet Filter Banks for ECG Applications using
Reconfigurable Multiplier Blocks
Yaprak Eminaga, Adem Coskun, and Izzet Kale
Applied DSP and VLSI Research Group, University of Westminster, London, W1W 6UW, United Kingdom
Email: y.eminaga@my.westminster.ac.uk, a.coskun@westminster.ac.uk, kalei@westminster.ac.uk
Abstract—There is an increasing demand for wavelet-based
real-time on-node signal processing in portable medical devices
which raises the need for reduced hardware size, cost and
power consumption. This paper presents an improved Recon-
figurable Multiplier Block (ReMB) architecture for an 8-tap
Daubechies wavelet filter employed in a tree structured filter
bank which targets the recent Field-Programmable-Gate-Array
(FPGA) technologies. The ReMB is used to replace the expensive
and power hungry multiplier blocks as well as the coefficient
memories required in time-multiplexed finite impulse response
filter architectures. The proposed architecture is implemented
on a Kintex-7 FPGA and the resource utilization, maximum op-
erating frequency and the estimated dynamic power consumption
figures are reported and compared with the literature. The results
demonstrated that the proposed architecture reduces the hard-
ware utilization by 30% and improves the power consumption
by 44% in comparison to architectures with general purpose
multipliers. Thus, the proposed implementation can be deployed
in low-cost low-power embedded platforms for portable medical
devices.
Index Terms—biosignal processing, Discrete Wavelet Trans-
form, multiplierless, reconfigurable multiplier blocks, power
analysis.
I. INTRODUCTION
Wearable biosignal acquisition systems with embedded dig-
ital signal processing capabilities have been a major research
focus in the wireless long-term health monitoring field. The
application of on-sensor signal processing is crucial due to its
great potential for reducing the system power consumption,
introducing robust and autonomous results, and decreasing the
amount of data to be transmitted [1]. One dimensional Discrete
Wavelet Transform (DWT) is extensively used in a wide range
of biomedical signal processing applications including denois-
ing, and feature extraction, and there has been a significant
amount of studies in order to minimize the implementation
cost and power dissipation of the DWT filter banks [2]–[4].
Most of this research is concentrated on the multiplier free
design of the wavelet filters where either shift-add operations
[4] or distributed arithmetic [5] is used to replace the constant
multiplications i.e. multipliers. In the authors’ previous work
[2], an efficient multiplier free implementation of an 8-tap
Daubechies (db4) wavelet filter using a ReMB was presented
and was shown that the proposed design reduced the hardware
complexity by 25% in comparison to conventional fixed point
implementation using a General Purpose (GP) multiplier. In
this paper, a new ReMB design is proposed and implemented
on an FPGA which utilizes modified and improved shift-
add networks, where the resource utilization and the power
dissipation figures are presented. Section II introduces the
preliminaries of the method used to design and implement
the ReMB specific for 8-tap Daubechies filter coefficients. In
addition, further details regarding the structure of the ReMB
are shown in Section II. In Section III, resource utilization
and estimation of power dissipation figures are demonstrated
and compared with other db4 implementations from the open
literature. Finally, Section IV presents the conclusions.
II. METHOD
The concept of ReMB employs efficient use of the Config-
urable Logic Blocks (CLBs) of the Xilinx FPGAs which are
composed of Look-Up Tables (LUTs), registers, fast carry-
chain logic and wide multiplexers [6]. This technology is
used to replace the multiple constant multiplications with more
hardware and power efficient shift-add networks.
It is well known that a 1-bit full adder/subtracter (will be
referred to as adder in the rest of this document) can be
implemented using the dedicated carry-chain logic and an LUT
to implement the remaining XOR gate. In [7], the authors
proposed placement of a 2-to-1 multiplexer (2:1 mux) before
a 1-bit adder where the output of the mux is connected to
either one of the adder inputs. This multiplexer can be imple-
mented inside the LUT which was used for the aforementioned
XOR gate, thus no additional hardware is required. This new
configuration of a 1-bit adder, adds reconfigurability to the
design where two different results can be obtained with the
same configuration and hardware resources. Each of the mux-
adder combinations are referred as ‘basic structure’ and is
presented in Fig. 1(a). An example of possible outcomes are
also presented in Fig. 1 where S is the operation output, S0
is the mux select line that can be either 0 or 1, A, B0 and B1
are the inputs and ‘<< a’represents the amount of hard-wired
left shift to scale the corresponding inputs. In this particular
configuration S0 is used as the carry-in input where S0 = 0
and S0 = 1 lead to an addition and a subtraction operation,
respectively. Of course, the output and the behaviour of the
adder can be altered by making minor modifications. Further
details regarding this method can be found in [7]. Although
this is a very efficient concept and still can be used for replac-
ing constant multiplications with shift-add networks, it is out-
of-date since it was introduced for the Xilinx 4-series FPGAs
which employed 4-input LUTs. In [2], the authors extended
this method for the recent FPGAs which replace 4-input LUTs
with 6-input ones, that led to the replacement of 2:1 muxes
with 3:1 ones for no additional cost. This modification is
presented in Fig. 1(b) where either one of the mux select
lines S0 or S1 is used for controlling the operation of the
adder and three different outcomes are obtained. This simple
modification adds more reconfigurability to the basic structure
to reduce the hardware complexity further, especially for high
filter orders. However, this method can be further improved by
using 4:1 muxes which have the same hardware requirements
as the 3:1 muxes but with one additional input port. A 4:1 mux
requires 6 input ports where two are for the select lines and 4
for the inputs. Since the 7-series FPGAs consist of only 6-input
LUTs, an extra one input will cause utilization of an additional
LUT, which is not desired. Therefore, one of the inputs of the
mux must be shared amongst others. Although input sharing
might seem as redundant and unnecessary, it actually provides
more reconfigurability to the design. This is due to the fact
that the select lines of the muxes are also used for controlling
the operation of the adder as addition or subtraction. This
enables the full utilization of the FPGA blocks which in turn
further reduces the hardware usage. Fig. 1(c) demonstrates
the proposed basic structure which can achieve four different
outputs with three distinct inputs as opposed to three outputs
which was presented in [2].
























































Fig. 1. Basic structures with (a) 2:1 mux [7] (b) 3:1 mux [2] and (c) 4:1
mux.
TABLE I
FIXED-POINT (11-BIT) db4 WAVELET FILTER COEFFICIENTS, THEIR ADDER





0.0107421875 11 11 2 22(2 + 1)− 1
0.033203125 34 17 1 2(24 − 1)
0.03125 32 1 0 25
0.1875 192 3 1 26(2 + 1)
0.0283203125 29 29 2 22(4 + 1) + (23 + 1)
0.630859375 646 323 2 2(26(4 + 1) + (2 + 1))
0.71484375 732 183 2 22(26(2 + 1)− (23 + 1))
0.23046875 236 59 2 22(22(16− 1)− 1)
the basic structures, given in Fig. 1(c); in chain (i.e. horizon-
tally cascaded) and tree forms (i.e. inputs of a mux connected
to the output of another basic structure). For example, cascade
of two basic structures provides an output set of 16 (4 × 4)
coefficients. Thus for a coefficient set of 8, two basic structures
is sufficient. However, the number of adders required by each
coefficient is also a vital parameter while designing a ReMB
which determines the minimum number of adders in each
path from the input to the output of the ReMB as well as
the quantity of basic structures to be used. Thus, the number
of coefficients and the minimum number of adders for each
coefficient must be determined.
A. Design of Reconfigurable Multiplier Block for db4 filter
coefficients
The coefficient word-length plays a significant role in the
design of the ReMBs since the structure of the multiplier block
depends on the desired coefficient precision. Longer word-
lengths result in increased number of adders and thus, higher
resource utilization. On the other hand, insufficient number of
bits will deteriorate the filter characteristics and operation. In
order to achieve the minimum possible coefficient wordlength
retaining the desired wavelet filter characteristics, the impact
of the precision loss is measured by comparing the floating-
point and fixed-point filter responses in terms of quantization
noise power. Therefore, for this study, the filter coefficients are
quantized to 11-bit (10 fractional bits and one sign bit) which
results in -68 dB of filter response mismatch and retains the
filter operation as well as the scaling and wavelet functions.
Prior to designing the ReMB, the aforementioned coefficients
are scaled with 210 in order to have integer (Z) values and are
listed in Table I.
The lowpass and highpass db4 filters employed both in the
decomposition and reconstruction Filter Bank (FB) are power
complimentary which states that both filters have the same
coefficients but with alternating signs. Thus, there are only
eight distinct coefficients and the same ReMB structure can
be used for both filters with an additional 2:l mux at the ReMB
output to select between positive and negative coefficients.
First of all, the adder depth, which is the minimum number of





















Fig. 2. The ReMB designed for db4 wavelet filters.
[7], is calculated for all eight coefficients and they are as
illustrated in Table I. Although the maximum adder depth is
two, a minimum of three adders are required to obtain each
coefficient, thus three basic structures are interconnected and
the least significant bits of the mux select lines are used as
carry-ins to control the adder operation. The final structure for
the ReMB is presented in Fig. 2.
B. DWT Filter Bank Architecture
Biomedical signals have frequency bands upto a few kHz,
hence they require comparably slow operating frequencies.
Recent FPGAs can operate at upto a few GHz, therefore time-
multiplexed architectures can be easily used and this way
hardware utilization of the DWT FB can be massively reduced.
A conventional time-multiplexed Tap-Delay Line (TDL) filter
is composed of an input memory, a coefficient memory and
a single Multiply-Accumulate unit with a GP multiplier [2],
whereas in this study the GP and the coefficient memory are
replaced by the proposed ReMB given in Fig. 2. Here, two
tree structured 1-level analysis FBs are implemented where
time-multiplexed TDL Finite Impulse Response (FIR) filter
structures employing a GP multiplier and the proposed ReMB
are used for the lowpass (g(k)) and the highpass (h(k)) db4
filters. The architecture of the FIR filter with a ReMB and
the implemented FB are as shown in Fig. 3. In the case
of db4 filters, the time-multiplexed TDL architecture will
operate sequentially and the filter will provide an output once
every eight clock cycle. The proposed ReMB produces the
intermediate results of input and coefficient multiplication at
each clock cycle which eliminates the need for a coefficient
memory. For the implementation, quantization is not applied
to the internal arithmetic of the ReMB in order to retain the
highest precision. The filter output is then truncated to discard
the fractional part and scaled down by 2−10. For validating the
proposed structure, an 8-bit ElectroCardioGram (ECG) data
obtained from the MIT-BIH Arrhythmia database is fed to the
FB where the ReMB , accumulator and filter outputs are 19-,
20- and 10-bit, respectively.
TABLE II
SELECT LINE (S0:S4) VALUES FOR MULTIPLEXERS GIVEN IN FIGS.2 AND
3 TO GENERATE THE LOWPASS ANALYSIS FILTER COEFFICIENTS.
Z S0 S1 S2 S3 S4−g
h0 -11 1 1 1 0 1
h1 34 0 0 1 1 1
h2 32 3 X 3 0 1
h3 -192 3 1 2 0 0
h4 -29 2 0 0 0 1
h5 646 0 0 2 1 0
h6 732 2 1 2 2 1
h7 236 1 3 1 2 0
C. Controller
The controller is simply an up-counter followed by a
decoder where the counter generates the address to control
the input memory and the decoder decodes these addresses
to generate the mux select lines. Table II presents the select
line values required to generate the lowpass filter (g(k))
coefficients. S0 : S3 corresponds to the select lines of the
muxes given in Fig. 2 and S4−g is the mux at the output of
the ReMB given in Fig. 3. Select line values 0, 1, 2, and 3
choose the mux input from top to bottom, 0 and 3 selecting
top and bottom input, respectively, where X is a Don’t Care
indicating that the muxes and adders are not employed in the
generation of the corresponding coefficient.
III. HARDWARE VALIDATION AND COST ASSESSMENT
For hardware validation, cost assessment and performance
evaluation the aforementioned filter bank architectures are
designed using the System Generator for DSP in the Mat-
lab/Simulink environment and are synthesized and imple-
mented on a Kintex-7 (xc7k325tffg900) FPGA with Vivado
v16.2. In order to, validate the performance of the system, 8-bit
ECG dataset from the MIT-BIH Arrhythmia database are used.
The resource utilization for both filter banks are presented in
















Fig. 3. One level analysis filter bank comprised of a lowpass (g(k)) and
highpass (h(k)) time-multiplexed TDL filters as well as the input memory
and the controller.
TABLE III










CFB3with GP Proposed (ReMB)
Architecture Matrix Matrix DA4 Lifting Lifting Time-multiplexed Time-multiplexed
Input word length (bits) 8 8 8 8 8 8 8
Adders 22 27 - 27 24 1 4
Multipliers 0 0 0 0 0 1 0
LUTs 734 692 614 470 389 309 218
Registers 180 - - 133 101 158 158
Max. Frequency (MHz) 69 - 149.3 63 112 160 164
Power (mW)5 6.8 8.1 - - - 3.026 2.097
Device Cyclone II Cyclone II Stratix II Virtex-6 Virtex-6 Kintex-7 Kintex-7
1 Algebraic Integer Quantization (AIQ) 2 Finite-Precision (FP) 3 Conventional Filter Bank (CFB) 4 Distributed Arithemtic (DA)
5 Measured at 50 MHz
In addition, Table III presents and compares the resource
utilization, maximum clock frequency and the dynamic power
consumption figures (if applicable) of other multiplier free db4
analysis filter bank implementations from the open literature
along with the proposed ReMB implementation. System power
consumption is estimated at clock speed of 50 MHz in order
have fair comparisons with the literature and Xilinx Power
Estimator tool is used for more accurate analysis. In [4],
Wahid. et al. presented a matrix based AIQ mapped and a
conventional fixed-point 1-level decomposition db4 filter bank
architecture. The hardware cost was listed as 734 and 692
LUTs for AIQ and FP based implementation, respectively
for coefficients with 10-bit precision. A more recent study
by Hasan et al. [8], proposed two architectures which were
lifting-based structures of the db4 wavelet filters. Here the
filter coefficients were divided into lifting steps and shift-
add networks were used for implementing two 1-level de-
composition FB without multipliers. The resource utilizations
for both FBs were reported as 470 LUTs, 133 Registers
and 389 LUTs, 101 Registers for Scheme 1 and Scheme 2,
respectively. In this work, two 1-level decomposition FBs are
implemented where the one employing a GP multiplier serves
as a reference design. The ReMB based filter bank employs
218 LUTs and 158 registers where the GP multiplier based
FB employs 309 LUTs and 158 Registers. When compared
to the literature and the reference design, the proposed ReMB
based FB exhibits the least hardware resources and the fastest
frequency of operation. In [4], the power consumption figures
were also presented for the two architectures implemented
which were 6.8 and 8.1 mW, respectively, whereas power
consumption of the proposed designs are 2.097 and 3.026 mW,
respectively, where the use of ReMB improves the dynamic
power consumption by 44% compared to the GP multiplier
based FB and is the lowest amongst others.
IV. CONCLUSION
In this paper an area and power efficient multiplierless
filter architecture to be employed in both decomposition and
reconstruction wavelet filter banks is presented. The proposed
architecture is designed as a time-multiplexed TDL FIR filter
in which the multiplier is replaced with a ReMB that is
implemented via hard-wired shifts, adders and multiplexers. In
order to evaluate resource efficiency and power consumption,
the proposed architecture is implemented on a Kintex-7 FPGA
and compared to a reference design implemented using a
standard parallel multiplier and to the designs existing in the
open literature as detailed in Table III. As the results reported
in Table III demonstrate, the proposed ReMB decreases the
overall hardware cost by 30% and improves the dynamic
power consumption by 44% compared to a GP multiplier
architecture. The low-cost and hardware efficient structure of
the proposed multiplier is suitable for DWT filter banks and
may be used in low-cost embedded platforms for ambulatory
physiological signal monitoring and analysis.
ACKNOWLEDGMENT
The authors wish to thank the University of Westminster
Faculty of Science and Technology for the PhD Studentship.
REFERENCES
[1] A. J. Casson, “Opportunities and challenges for ultra low power signal
processing in wearable healthcare,” in Signal Processing Conference
(EUSIPCO), 2015 23rd European. IEEE, 2015, pp. 424–428.
[2] Y. Eminaga, A. Coskun, and I. Kale, “Multiplier Free Implementation of
8-tap Daubechies Wavelet Filters for Biomedical Applications,” in 2017
New Generation of CAS (NGCAS). IEEE, 2017, pp. 129–132.
[3] P. Longa, A. Miri, and M. Bolic, “A flexible design of filterbank
architectures for discrete wavelet transforms,” in IEEE International
Conference on Acoustics, Speech and Signal Processing, 2007., vol. 3.
IEEE, 2007, pp. III–1441.
[4] K. A. Wahid, M. A. Islam, and S.-B. Ko, “Lossless implementation of
Daubechies 8-tap wavelet transform,” in IEEE International Symposium
on Circuits and Systems (ISCAS), 2011. IEEE, 2011, pp. 2157–2160.
[5] A. M. Al-Haj, “Fast discrete wavelet transformation using FPGAs and
distributed arithmetic,” International Journal of Applied Science and
Engineering, vol. 1, no. 2, pp. 160–171, 2003.
[6] Xilinx, “Series FPGAs configurable logic block,” User Guide, San Jose,
CA, vol. 1, 2016.
[7] S. S. Demirsoy, I. Kale, and A. Dempster, “Reconfigurable Multiplier
Blocks: Structures, Algorithm and Applications,” Circuits, Systems, and
Signal Processing, vol. 26, no. 6, pp. 793–827, 2007.
[8] M. M. Hasan and K. A. Wahid, “Low-Cost Lifting Architecture and
Lossless Implementation of Daubechies-8 Wavelets,” IEEE Transactions
on Circuits and Systems I: Regular Papers, 2018.
