Algorithm/Architecture Co-design of Proportionate-type LMS Adaptive
  Filters for Sparse System Identification by Mula, Subrahmanyam et al.
ar
X
iv
:1
70
3.
10
65
8v
1 
 [c
s.O
H]
  1
7 M
ar 
20
17
Algorithm/Architecture Co-design of
Proportionate-type LMS Adaptive Filters for Sparse
System Identification
Subrahmanyam Mula, Vinay Chakravarthi Gogineni, Anindya Sundar Dhar Member, IEEE,
Abstract
This paper investigates the problem of implementing proportionate-type LMS family of algorithms in hardware for sparse
adaptive filtering applications especially the network echo cancelation. We derive a re-formulated proportionate type algorithm
through algorithm-architecture co-design methodology that can be pipelined and has an efficient architecture for hardware
implementation. We study the convergence, steady state and tracking performances of these re-formulated algorithms for white,
color and speech inputs before implementing them in hardware. To the best of our knowledge this is the first attempt to implement
proportionate-type algorithms in hardware. We show that Delayed µ-law Proportionate LMS (DMPLMS) algorithm for white input
and Delayed Wavelet MPLMS (DWMPLMS) for colored input are the robust VLSI solutions for network echo cancelation where
the sparsity of the echo paths can vary with time. We implemented all the designs considering 16-bit fixed point representation
in hardware, synthesized the designs and synthesis results show that DMPLMS algorithm with ≈ 25% increase in hardware over
conventional DLMS architecture, achieves 3X improvement in convergence rate for white input and DWMPLMS algorithm with
≈ 58% increase in hardware achieves 15X improvement in convergence rate for correlated input conditions.
Index Terms
Adaptive filters, Proportionate type algorithms, Wavelet tranform, Mean square deviation, Logarithmic number system, Network
Echo Cancelation, Algorithm Architecture co-design, VLSI architectures
I. INTRODUCTION
Many real-life systems such as network echo cancelation [1], underwater communication [2] and HDTV terrestrial transmis-
sion [3] exhibit impulse responses which are often sparse and sparse system identification [4] became an important research
area in the last decade with the invention of proportionate-type algorithms. First member of Proportionate-type LMS family of
algorithms is the Proportionate Normalized Least Mean Square (PNLMS) algorithm [5], which updates the filter coefficients by
assigning a gain proportional to the magnitude of the current coefficient. The PNLMS algorithm has been shown to outperform
the LMS and NLMS [6] algorithms when operating on a sparse impulse response.
However, PNLMS algorithm performance degrades and becomes worse than the NLMS algorithm when the impulse response
is dispersive. Several improved PNLMS algorithms [7]- [9] were proposed in literature to address this issue and to make the
algorithms more robust against varying sparsity.
Another set of algorithms was designed by seeking a condition to achieve the fastest overall convergence when all coefficients
reach the ǫ-vicinity of their true values simultaneously (where ǫ is a small positive number). This approach results in the µ-law
PNLMS (MPNLMS) [10] and its variant ǫ-law PNLMS [11] (EPNLMS). The MPNLMS algorithm addresses the issue of
assigning too much update gain to large coefficients, which occurs in the PNLMS algorithms. Even the MPNLMS convergence
rate becomes prohibitively slow for correlated input conditions such as speech. Wavelet MPNLMS (WMPNLMS) [12] is
designed to address this issue by de-correlating the input at the same time preserving the sparseness of the impulse response.
As we can see significant research effort has been dedicated to the development of high-performance adaptive algorithms
based on proportionate adaptation. However, much less is known about their optimized implementation in dedicated hardware
because of the huge computational penalty. In this paper we try to address this gap. To the best of our knowledge this is the
first attempt to implement proportionate type algorithms in hardware. we make several reformulations to the original PNLMS
algorithm to make it VLSI friendly and the reformulated algorithm is implemented in hardware. Our main contributions include
:
1) Various proportionate-type LMS algorithms and their VLSI implementation aspects are studied in detail.
2) Proposed DMPLMS for white input and DWMPLMS algorithm for color input through algorithm-architecture co-design
and we show that the entailed loss in performance due to complexity reduction is negligible.
3) Proposed a novel multiplier-less 3-level sliding HAAR wavelet transform for DWMPLMS algorithm, to exploit the
redundancies that exist in wavelet computation of streaming input samples .
4) As a proof-of-concept, we provide a synthesis study in 180nm-CMOS technology of the proposed DPLMS, DMPLMS
and DWMPLMS architectures and compare the improvement in rate of convergence vs hardware efficiency.
The authors are with the with the Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology (IIT) Kharagpur,
West Bengal 721302, India, Email: svmula@iitkgp.ac.in.
2The rest of the paper is organized as follows. In the next section, we explain the motivation and formulate the problem. In
Section III, we propose the algorithmic re-formulations and the corresponding VLSI architecture. Section IV deals with the
transform domain proportionate type algorithms and their VLSI implementation aspects. Simulation results are presented in
Section V and ASIC synthesis results are presented in Section VI and we conclude the paper in Section VII.
II. BACKGROUND AND MOTIVATION
A. Review of Proportionate-type NLMS (Pt-NLMS) Algorithms
Consider the problem of identifying an unknown sparse system modeled by the L tap coefficient vector wopt which takes a
signal u(n) with variance σ2u as input and produces the observable output d(n) = w
T
optu(n)+v(n), where u(n) = [u(n), u(n−
1), ..., u(n − L + 1)]T is the input regressor and v(n) is an observation noise with variance σ2v which is assumed to be
white and independent of u(m), for all m, n. Then the Pt-NLMS algorithm iteratively updates the filter coefficient vector,
w = [w0, w1, ..., wL−1]T as follows:
w(n+ 1) = w(n) + µ
G(n)u(n)e(n)
uT (n)G(n)u(n) + δp
, (1)
where the estimated error e(n) = d(n)− y(n) with filter output y(n) = wT (n)u(n) and µ is the adaptation step size. δp is the
regularization parameter, and G(n) is the gain matrix. This gain matrix is the key factor here which distinguishes Pt-NLMS
from conventional NLMS. The gain matrix G(n) is diagonal i.e., G(n) = diag(g0(n), g1(n), ...gL−1(n)), with gi(n) ∝ |wi(n)|
is the gain factor for the tap i. For Proportionate NLMS, gi(n) is evaluated as follows:
gi(n) =
γi(n)
1
L
L−1∑
i=0
γi(n)
, 0 ≤ i ≤ (L− 1), (2)
where
γi(n) = max[ρ γmin(n),F[|wi(n)|]] (3)
γmin(n) = max(δ,F[|w0(n)|], ...,F[|wL−1(n)|], (4)
F[|wi(n)|] = |wi(n)| for PNLMS. here δ is used to prevent the coefficients from stalling during initialization stage when all
the coefficients are reset to 0. The parameter ρ ensures minimum gain to inactive coefficients. It can be seen that if the current
magnitude of a coefficient is large, a large step size parameter will be assigned, where as for a small coefficient the proportionate
step size is small. In this way it emphasizes the large coefficients to speed up their convergence, so it demonstrates very fast
initial convergence for sparse impulse response. However, this performance improvement comes at the cost of complexity. In
their 2004 Freescale application note [13] Dybe et.al. felt that because of a significant computational penalty imposed by the
modified adaptation formula combined with the generation of matrix G(n), the PNLMS algorithm appears to be most suitable
for implementation in an ASIC technology. However, Jie, Chen in his PhD dissertation in 2008 [14] stated that ’The high
complexity associated with both PNLMS and IPNLMS algorithms makes them unsuitable for hardware implementation’. In the
light of these two observations, we carry out a detailed complexity analysis of the PNLMS algorithm to show its inapplicability
for VLSI implementation in its native form.
B. Complexity Analysis
There are four main tasks performed in each iteration of the PNLMS which are elaborated below and all these steps need
to be completed for one iteration before proceeding to the next iteration.
1) Filter output calculation: This step is basically computing the inner product of input vector with the weight vector which
requires L multiplications which can happen in parallel and adding all the partial results in an adder tree structure which
require L adders and has a time complexity of Tmult + log2(L)Tadd.
2) Weighted euclidian norm calculation : This step is to calculate the vector matrix vector product uT (n)G(n)u(n) + δp in
each iteration which requires 2L multiplications, L additions and has time complexity of 2Tmult + (1 + log2(L))Tadd
3) Gain Calculation : This has two max finding steps, one multiplication, one division, one F (.) evaluation and summation
of L terms and has time complexity of 2log2(L)Tcmp + Tmult + Tdiv + TFeval .
4) Weight update : This requires L multiplications and L additions and all of them can happen in parallel. Thus time
complexity is Tmult + Tadd.
The complexities of all the steps are summarized in Table 1 and Table 2. We see that time complexities of all the steps add
up for the critical path which is shown below:
Tcrit = 5Tmult + (2 + 3log2(L))Tadd + 2log2(L)Tcmp + Tdiv + TF eval.
3This severely limits the applicability of the original PNLMS algorithm for real-time VLSI implementations which have stringent
throughput requirements. In order to achieve an efficient VLSI implementation, actions must be taken at the algorithm level
to further simplify the complexity characteristic of PNLMS (without compromising the performance) before going into the
architecture design. We discuss these simplifications in the next section.
Table I: Area Complexity of PNLMS
Step Mult Div Add Cmp
Filter output L 0 L 0
Weighted normalization 2L 0 L 0
Weight update 2L 1 L 0
Gain calculation 2 1 L 2L
Table II: Time Complexity of PNLMS
Step Critical path
Filter output Tmult + log2(L)
Weighted normalization 2Tmult + (1 + log2(L))Tadd
Weight update Tmult + Tadd
Gain calculation 2log2(L)Tcmp + Tmult + Tdiv + TFeval
III. ALGORITHMIC RE-FORMULATIONS AND PROPOSED ARCHITECTURE
In this section we make several reformulations to the original PNLMS algorithm to make it VLSI friendly without compro-
mising the performance of the algorithm. To ensure this, we compare the performance of the algorithm after each simplification
with that of the original algorithm and finally we design a low complexity VLSI architecture for the re-formulated algorithm.
A. Simplified Gain Calculation
The first simplification is in the calculation of the gain calculation. For the simplified PNLMS algorithm, the gain factors
are evaluated as follows:
gi(n) =
γi(n)
L−1∑
i=0
γi(n)
, 0 ≤ i ≤ (L− 1), (5)
where
γi(n) = F[(wi(n) + ρ)], (6)
and
F[(wi(n) = |wi(n)|. (7)
the parameter ρ is a small positive constant added to avoid the stalling when all the tap weights are zero at the reset and also
to ensure minimum gain to the inactive coefficients. We can see that even with this simplified gain calculation, the adaptation
gain is proportional to the magnitude of the filter tap at time index n. The simplified gain calculation avoids the usage of the
max functions which are employed in the original PNLMS, there by reduces the time complexity. We can use an adder tree
structure for adding the absolute weights of all the taps to get the denominator of Eq. 5 and then the reciprocal of this sum
can be fed to all the taps for proportional gain calculation.
B. Proportionate LMS
Unlike the NLMS, the normalization in the denominator of PNLMS update is a weighted normalization of the input vector
and the weight matrix (G) changes in each iteration. So we can’t compute the normalization term recursively and calculating
this vector matrix vector product freshly in each iteration requires 2L multiplications, L additions and adds significant area
and time complexity (complexity grows with L, L would be generally large for real-time applications) to the already complex
proportionate adaptation. To alleviate this problem, we proposed PLMS [15] (which is similar to LMS) by removing this
weighted normalization and analyzed the convergence performance and it is shown that the PLMS algorithm is stable under
0 < µ < 2 for a white input with mean zero and unit variance and is able to perform as the original PNLMS. The PLMS
update equation is given by,
w(n+ 1) = w(n) + µG(n)u(n) e(n), (8)
where gain matrix calculation is same as that of simplified PNLMS algorithm. Now, the performance of the re-formulated
algorithm is compared (with simplified gain calculation and without normalization) with that of the original PNLMS. For
4this, we considered a sparse system identification problem and we use the the sparseness measure Sm =
L
L−
√
L
(
1−
‖h‖1√
L‖h‖2
)
to characterize the sparse system. The unknown system of length 512 of which only 64 taps are active (Sm = 0.8637) is
considered. Learning curves (Normalized MSD vs Iterations) shown in Fig. 1 are obtained by averaging over 500 experiments.
All the other simulation parameters are shown in the text box. As we can see PNLMS and reformulated PLMS outperforms
the conventional NLMS significantly. We can also see that the performances of reformulated PLMS and PNLMS are same and
there is no penalty for the proposed reformulations.
0 1000 2000 3000 4000 5000 6000 7000
-35
-30
-25
-20
-15
-10
-5
0
Iteration (n)
N
or
m
al
iz
ed
 M
SD
 (d
B)
 
 
NLMS PNLMS Proposed PLMS
L= 512, S
m
 = 0.8637, SNR =30dB,
ρPLMS = 0.001 , δPNLMS = ρPNLMS = 0.01,
µPLMS=0.7, µPNLMS = 0.7, µNLMS = 0.7
Fig. 1: Performance comparison of PNLMS with Reformulated PLMS
C. Delayed Proportionate LMS (DPLMS)
The LMS algorithm in its original form is not suitable for realtime VLSI implementations because of the coefficient update
feedback loop. To overcome this, LMS is modified to a form known as delayed LMS (DLMS) [16]-[17] with an assumption
that the error gradient e(n) ∗ u(n) does not change much with the delay M . The weight update equation of DLMS is shown
below:
w(n+ 1) = w(n) + µu(n−M) e(n−M), (9)
where e(n−M) = d(n−M)− y(n−M). With the introduction of the M delay registers in the feedback loop, we can apply
re-timing [18] to the LMS circuit, so that the critical path is reduced to either one multiplier Tmult or one adder Tadd as desired
by the application. The concept of delayed adaptation can be extended to proportionate LMS also and the corresponding weight
update equation for the delayed PLMS (DPLMS) algorithm is given by,
w(n+ 1) = w(n) + µG(n−M)u(n−M)e(n−M), (10)
where
G(n−M) = diag(g0(n−M), g1(n−M), ...gL−1(n−M)), (11)
with
gi(n−M) =
γi(n−M)
L−1∑
i=0
γi(n−M)
, 0 ≤ i ≤ (L− 1), (12)
and
γi(n−M) = F[(wi(n−M) + ρ)], (13)
F[(wi(n−M) = |wi(n−M)|. (14)
Now we compare the performance of the DPLMS with that of PLMS. We consider the same sparse system identification
problem as in last subsection and simulation parameters are shown in the text box. Learning curves (Normalized MSD vs
Iterations) shown in Fig. 2 are obtained by averaging over 500 experiments. We can notice that after the initial phase, the
convergence rate of the PLMS algorithm slows down significantly, even becoming slower than NLMS. The large coefficients
converge very fast at the cost of slowing down dramatically convergence of the small coefficients. DPLMS suffers even more
because of the direct proportional gain and delayed adaptation. To address this issue, these re-formulations are extended to the
µ-Law PLMS (MPLMS) which has more balanced gain distribution.
50 1000 2000 3000 4000 5000
-35
-30
-25
-20
-15
-10
-5
0
N
or
m
al
iz
ed
 M
SD
 (d
B)
Iterations (n)
 
 
NLMS DPLMS PLMS
L=512, S
m
 = 0.8637, 
SNR = 30dB,  Delay = 5
ρ = 0.001 , δ
nlms = 0.01,
µDPLMS = 0.5, µNLMS = 0.7,
µPLMS =0.7
Fig. 2: Performance comparison of PLMS with Delayed PLMS
D. Delayed µ-Law PLMS (DMPLMS)
µ-Law PNLMS (MPNLMS) [10] offers more balanced distribution of the adaptation energy among all the coefficients as
it is based on the optimization criteria that all the coefficients converge simultaneously to the ǫ-vicinity of their true value
so that the overall convergence is the fastest. Here gain is proportional to the natural logarithm of the absolute value of the
current weight instead of the absolute of current weight itself. This logarithm in the F[(.)] equation (which is shown below) is
the only difference between MPNLMS and PNLMS. The parameter ξ is a very small positive number and its value should be
chosen based on the measurement noise level. For applications such as network echo cancelation, ξ = 0.001 is a good choice
because the echo below −60dBm is negligible.
F[(w0(n))] = ln(1 +
|w0(n)|
ξ
). (15)
We extended all the aforementioned re-formulations to MPNLMS and also we simplified the F[(.)] equation to the following
equation for our VLSI implementations and verified its effectiveness through simulations. Since we changed the natural
logarithm to base-2 logarithm, k is chosen to be 6.
F[(w0(n))] = log2(1 +
|w0(n)|
2−k
). (16)
Because of this balanced gain distribution delayed adaptation of MPLMS doesn’t suffer much and performances of MPLMS
and DMPLMS (for same experimental conditions) are shown in Fig. 3. We can see that delayed MPLMS with logarithmic
simplifications is a robust algorithm for white input case which is suitable for VLSI implementation. The formal algorithm is
summarized in Algorithm 1.
Algorithm 1: Delayed MPLMS algorithm
Initialization : wi(1) = 0, 0 ≤ i ≤ (L− 1)
Parameters : µ, ρ, k
Updation :
e(n) = d(n)− wT (n)u(n)
F[(wi(n))] = log2(1 +
|wi(n)|
2−k
)
γi(n) = F[(wi(n) + ρ)]
gi(n) =
γi(n)
L−1∑
i=0
γi(n)
, 0 ≤ i ≤ (L− 1)
G(n) = diag(g0(n), g1(n), ...gL−1(n))
∆W = µG(n−M)u(n−M)e(n−M)
w(n+ 1) = w(n) + ∆W
60 1000 2000 3000 4000 5000
-35
-30
-25
-20
-15
-10
-5
0
N
or
m
al
iz
ed
 M
SD
 (d
B)
Iterations (n)
 
 
NLMS DMPLMS MPLMS
L=512, S
m
 = 0.8637, SNR = 30dB, 
ρ = 0.001 , δNLMS = 0.01, Delay = 5,
  kMPLMS = 6
µDMPLMS = µMPLMS = µNLMS = 0.7
Fig. 3: Performance comparison of MPLMS with Delayed MPLMS
log2(N) STAGE ADDER TREE
+ d(n−M1)
e(n-M1)
X
µ
LOG
LOG
+
-
+
E
E = log2


µe(n−M)
L−1∑
i=0
Fi(n−M)


.
∑
F (n−M1) y(n−M1)
u(n-l)
E
Wi(n)
X
+
F(.)
F(.)
LOG
u(n)
ALOG
pipeline register
sign conversion
u(n)
+
+
LOG
tap-out
Fig. 4: MPLMS Architecture
E. Logarithmic Number System to calculate the gradient
Even the simplified DPLMS gradient calculation involves division and fixed point division is quite a bit more complicated [19]
than fixed point multiplication, and usually takes a lot more cycles than performing a multiplication. Thus we use logarithmic
number system (LNS) which simplifies the division. The logarithmic conversion of a real number Q can be obtained by
log2(Q) = k + log2(1 + x) when Q = 2
k(1 + x), k is the leading one bit position and x is a fraction. Depending on how we
approximate this log2(1 + x) there have been multiple schemes in the literature. This can be achieved by using a LUT with
x as index and log2(1 + x) as the value or performing simple conversion on x. Mitchell [20] used the value of x to directly
approximate log2(1 + x). In this scheme the log-converter contains a simple Leading One Detector (LOD) circuit followed
by a barrel shifter and antilog converter is a concatenation of 1 and x followed by a barrel shifter. DMPLMS architecture
employing these log/antilog converters will be explained in the next section.
7F. Proposed Architecture
The architecture for the proposed DMPLMS and DPLMS is shown in Fig. 4. Please note that the only difference between
DPLMS and DMPLMS architectures is the F (.) module. One of the PLMS/MPLMS tap is zoomed and showed separately.
Each tap gets two inputs one is the regressor input from the tapped delay line, the other is the quantity E, which is defined as,
E = log2

 µe(n−M)L−1∑
i=0
Fi(n−M)

 . (17)
and it is used in the gradient calculation. As mentioned in previous subsection, to reduce the complexity we use the LNS for
gradient calculation. Since logarithm of a negative number is undefined, we take the logarithm of absolute value of e(n−M)
and propagate the sign of e(n−M) to the sign conversion unit in each tap. Please note that sign of this E is same as sign of
e(n−M) as the denominator of Equation (17) is always positive. Gradient in LNS format can be written as
∆W = 2log2(µG(n−M) u(n−M) e(n−M)). (18)
By expanding the gain matrix and by separating the terms which are independent of tap and which are dependent on tap, we
get
∆W = 2
log2

 µ |e(n−M)|L−1∑
i=0
F[wi(n−M)]

+log2(F[|wi(n−M)|]|u(n−M))|
. (19)
By replacing the first term with E which was defined above, we get the final equation for the gradient
∆W = 2E+log2(F[wi(n−M)])+log2(|u(n−M)|). (20)
As shown in each tap, logarithms of F[|wi(n)|]) and |ui(n)| are added and delayed by the required amount to match the delay
in the filter output path and the sum is then added to E, finally antilogarithm is applied to the resultant to get the magnitude
of the gradient. Sign of the gradient is decided by XOR operation between sign(e(n −M)) and sign(u(n−M)). After the
required sign conversion, the result is added to the old weight to get the new weight for each tap. Each tap generates two
outputs namely Tap out and F[wi(n)]. These two quantities of all the taps need to be added using an adder tree structure
to calculate the final filter output and the denominator of the gain matrix respectively. So we fold the adder tree by a folding
factor of 2 i.e., use the same adder tree in a time shared fashion to calculate both these quantities. If we use a carry save adder
tree, the propagation delay would be less and thus we can run the adder tree at twice the clock rate compared to the FIR filter.
G. Fixed point implementation
We have done simulations using MATLAB with 16-bit fixed point representations and with all the above suggested re-
formulations and compared the results with those of floating point simulations to see if the algorithms are tolerant to the
re-formulations and LNS approximations. Fig. 5 shows the comparison of both DPLMS and DMPLMS algorithms learning
curves of MSD by averaging over 10 runs. Because of the iterative and stochastic nature of the algorithms, we can notice that
they are tolerant to the logarithmic approximations and hence 16-bit fixed point and floating point curves coincide.
IV. TRANSFORM DOMAIN DELAYED WAVELET MPLMS ALGORITHM
A. Motivation
The convergence performance of LMS-type filters is highly dependent on the correlation of the input data and, in particular,
on the eigenvalue spread of the input correlation matrix R. Because of this, the convergence of LMS-type filters becomes
prohibitively slow for correlated input such as speech. We also found that the effect of delayed adaptation on the convergence
is much more pronounced in the case of colored input compared to the white input, which can be observed from Fig. 7,
It can also be noted that the performance degrades with the adaptation delay M for the DMPLMS algorithm and becomes
worse than that of NLMS when the adaptation delay is increased beyond 10 clocks. This problem can be addressed by
transform domain adaptive filters [21]. The main idea behind transform domain adaptive filters is that a de-correlating unitary
transform is applied to the input and then by power normalizing each input component in the transform domain, we can make
the autocorrelation matrix approximately an identity matrix. This approximate white input is then fed to the adaptive filter
for adaptation. However, please note that now the estimated filter weights are also in transform domain. Hence we need to
select the transform which de-correlates the input while preserving the sparsity of the transformed filter weights. For example
Discrete Cosine Transform (DCT) [21] which has excellent de-correlating properties but doesn’t preserve the sparsity which
can be observed from Fig. 6(a) and Fig. 6(b), and thus we will loose the advantage of proportional adaptation. On the other
hand, Discrete Wavelet Transform (DWT) with its time frequency localization property, preserves the sparseness of impulse
response [22] as shown in Fig.6(c) and also exhibits decent de-correlation property. Thus Wavelet MPNLMS (WMPNLMS) [12]
80 500 1000 1500 2000
-30
-25
-20
-15
-10
-5
0
N
om
al
iz
ed
 M
SD
(dB
)
Iterations (n)
 
 
DPLMS 16-bit Fixed point
DPLMS Actual
DMPLMS 16-bit Fixed point
DMPLMS Actual
µ = 0.25, L=64, S
m
 = 0.76
SNR = 30 dB,  Delay = 5
ρ = 0.001, kMPLMS = 6 
Fig. 5: Performance of fixed point implementation
0 50 100 150 200 250 300 350 400 450 500
-0.5
0
0.5
1
Tap #
Ta
p 
w
ei
gh
t
(a) Sparse Impulse Response.
0 50 100 150 200 250 300 350 400 450 500
-0.1
-0.05
0
0.05
0.1
Tap #
D
C
T 
W
ei
gh
t
(b) DCT Weights.
0 50 100 150 200 250 300 350 400 450 500
-0.5
0
0.5
1
Tap #
D
W
T 
W
ei
gh
t
 
 
(c) DWT Weights.
Fig. 6: Comparison of DCT/DWT coefficients.
0 1000 2000 3000 4000 5000
-30
-25
-20
-15
-10
-5
0
5
N
or
m
al
iz
ed
 M
SD
 (d
B)
Iterations (n)
 
 
NLMS
DMPLMS with M=10
DMPLMS with M=5
DMPLMS with M=2
DMPLMS with M=1
MPLMS
µ
nlms = µmplms = 0.7,
L=512, S
m
 = 0.8637,
SNR = 30dB, ρ = 0.001 ,
δ = 0.01, k
mplms = 6
Fig. 7: Performance degradation of DMPLMS for color input
is more suitable for sparse adaptive filters under correlated input. Hence, before feeding the input to the proposed DMPLMS
adaptive filter, it will be de-correlated using DWT. The resultant DWMPLMS algorithm with all the proposed re-formulations
is summarized in Algorithm 2.
In the next section we study different wavelet transforms and their convergence performances in the framework of delayed
proportionate adaptation.
9Algorithm 2: Delayed WMPLMS algorithm
Initialization : wi(1) = 0, 0 ≤ i ≤ (L− 1)
Parameters : µ, ρ, β, k
Updation :
uT (n) = Tu(n)
Ψi(n) = βΨi(n− 1) + (1− β)|(uT,i(n))|, 0 ≤ i ≤ (L− 1)
D(n) = diag(Ψi(n))
e(n) = d(n)− wTT (n)uT (n)
F[(wT,i(n))] = log2(1 +
|wT,i(n)|
2−k
)
γi(n) = F[(wT,i(n)) + ρ]
gi(n) =
γi(n)
L−1∑
i=0
γi(n)
, 0 ≤ i ≤ (L− 1)
G(n) = diag(g0(n), g1(n), ...gL−1(n))
∆w = µG(n−M)D−1(n−M)u(n−M)e(n−M)
wT (n+ 1) = wT (n) + ∆w
B. Study of various wavelet Transforms and their de-correlating properties
We considered three families of wavelets [23] namely Haar, Symlets denoted as Sym4 and Daubechies denoted as db4 for
the comparative study of their convergence performance when used in DWMPLMS algorithm and the results are shown in
Fig. 8. We see that DWMPLMS algorithms outperforms DMPLMS and DPLMS algorithms significantly. Among the wavelet
families Sym4 perform better than the other two. Even though Haar wavelet’s performance is inferior to the other two, when
we consider the performance complexity tradeoff as a metric, we show that Haar is the better than the rest. Next we analyzed
the performance with various levels of Haar decomposition and we noticed that there is a significant performance improvement
from two to three levels and beyond three levels of decomposition the performance improvement is negligible. Thus we
considered 3-level Haar wavelet for implementing DWMPLMS algorithm. In the next section we show that how a fast sliding
wavelet transform can be computed by taking the streaming nature of the input data into account.
0 2000 4000 6000 8000 10000 12000 14000
-20
-15
-10
-5
0
5
No
rm
ali
ze
d M
SD
 (d
B)
Iterations (n)
 
 
NLMS
DPLMS
DMPLMS
DWMPLMS with 2-level HAAR
DWMPLMS with all-level HAAR
DWMPLMS with 3-level HAAR
DWMPLMS with all level db4
DWMPLMS with all level sym4
L=512, S
m
 = 0.8960, SNR = 30dB, 
ρ = 0.001, δNLMS = 0.01, µNLMS = 0.25
µ
PLMS = µMPLMS = 0.1; µWMPLMS = 0.18;
µWMPLMS-Sym  = µWMPLMS-Db4 = 0.36;
Fig. 8: Performance of 3-level HAAR DWMPLMS
C. Sliding Wavelet transform
The data vector u(n) (which is defined as u(n) = [u(n)u(n − 1)...u(n − N + 1)]T ) is updated at each new iteration by
letting one data sample to enter in and one to leave. The transformed input vector is denoted by uT (n) i.e. uT (n) = Tu(n).
The streaming nature of the input can be used to exploit the redundancies that exist between calculation of running wavelet
transforms of u(n) and u(n+2) (where u(n+2) = [u(n+2)u(n+1)...u(n−N+3)]T ). Let T8 be 8-point Sym2 DWT which
has four low frequency coefficients h0, h1, h2 and h3 and four high frequency coefficients g0, g1, g2 and g3. Now consider
10
E
Tap LTap 1 Tap 2 Tap 3 Tap 4
log2(N) STAGE MUXED ADDER TREE
Sliding 3-level U-HAAR Block
d(n−M1)
X
µ
-
+
LOG
LOG
+ +
∑
F (n−M1) y(n−M1)
LOG LOG
X
+
+
ALOG
+
PWR
EST
Tap out
F(wi(n))
+
+
-
Power(i)
>> 3
-+
>> 3
LOG
ABS
uT,i(n)
+
+
E
+-
+
F(.)
E = log2


µe(n−M)
L−1∑
i=0
Fi(n−M)


sign conversion unit
pipeline register
u(n)
uT,i(n)
uT,1(n) uT,2(n) uT,3(n) uT,4(n) uT,L(n)
+
(a) DWMPLMS Architecture.
registers with clk/2
registers with clk
L/4 Outputs L/2 Outputs
u(n)
L/8 Outputs
clk/2clk/2
Level 3
Level 1
L/8 Outputs
Level 2
+
+
+ −
−
−
clk
U-HAAR U-HAAR
BLOCK
EVEN
BLOCK
ODD
u(n)
(b) Sliding HAAR Architecture.
Fig. 9: Proposed Architectures.
the matrix vector products T8u(n) which is shown below and T8u(n+ 2) for an 8-tap filter

h0 h1 h2 h3 0 0 0 0
0 0 h0 h1 h2 h3 0 0
0 0 0 0 h0 h1 h2 h3
h2 h3 0 0 0 0 h0 h1
g0 g1 g2 g3 0 0 0 0
0 0 g0 g1 g2 g3 0 0
0 0 0 0 g01 g1 g2 g3
g2 g3 0 0 0 0 g0 g1




u(n)
u(n− 1)
u(n− 2)
u(n− 3)
u(n− 4)
u(n− 5)
u(n− 6)
u(n− 7)


.
In [24], it has been shown that by defining basic adder cells and reusing the partial sums, computational complexity can
be significantly reduced and the number of multiplications can even become independent of the filter order under certain
conditions. However, this reduction is true only for one level of decomposition. Once we go for second level of decomposition,
i.e.,
(
T4 04
04 I4
)
T8u(n), the redundancies cease to exist since the wavelet coefficients obtained in the first level, i.e., T8u(n)
no longer have the streaming nature.
This complicates the wavelet transform computation and thus not suitable for DWMPLMS algorithm where we need at least
3-levels of decomposition to get good de-correlating properties. On the other hand if we consider the un-normalized Haar
wavelet whose coefficients are simple ±1s, the redundancies continue to exist through multiple levels of decomposition and
we can exploit them using a regular structure. For example consider 8-point un-normalized HAAR wavelet matrix with two
11
levels of decomposition (i.e., the product matrix
(
T4 04
04 I4
)
T8 ) which is shown below:


1 1 1 1 0 0 0 0
0 0 0 0 1 1 1 1
1 1 −1 −1 0 0 0 0
0 0 0 0 1 1 −1 −1
1 −1 0 0 0 0 0 0
0 0 1 −1 0 0 0 0
0 0 0 0 1 −1 0 0
0 0 0 0 0 0 1 −1




u(n)
u(n− 1)
u(n− 2)
u(n− 3)
u(n− 4)
u(n− 5)
u(n− 6)
u(n− 7)


.
Now, consider the low frequency wavelet components at nth time index
uT,0(n) = u(n) + u(n− 1) + u(n− 2) + u(n− 3),
uT,1(n) = u(n− 4) + u(n− 5) + u(n− 6) + u(n− 7),
uT,2(n) = u(n) + u(n− 1)− [u(n− 2) + u(n− 3)],
uT,3(n) = u(n− 4) + u(n− 5)− [u(n− 6) + u(n− 7)],
(21)
if we define a(n) = u(n)+u(n−1), then uT,0(n) = a(n)+a(n−2) and uT,2(n) = a(n)−a(n−2). Similarly at (n+2)
th
time index
uT,0(n+ 2) = u(n+ 2) + u(n+ 1) + u(n) + u(n− 1),
uT,1(n+ 2) = u(n− 2) + u(n− 3) + u(n− 4) + u(n− 5),
uT,2(n+ 2) = u(n+ 2) + u(n+ 1)− [u(n) + u(n− 1)],
uT,3(n+ 2) = u(n− 2) + u(n− 3)− [u(n− 4) + u(n− 5)].
(22)
Now, uT,0(n+ 2) = a(n+ 2) + a(n) and uT,2(n+ 2) = a(n+ 2)− a(n). Note that only a(n+ 2) need to be evaluated at
(n+ 2)th time index and all the other partial sums can be reused. Similarly high frequency components at nth time index
uT,4(n) = u(n)− u(n− 1),
uT,5(n) = u(n− 2)− u(n− 3),
uT,6(n) = u(n− 4)− u(n− 5),
uT,7(n) = u(n− 6)− u(n− 7),
(23)
and at (n+ 2)th time index
uT,0(n+ 2) = u(n+ 2)− u(n+ 1),
uT,0(n+ 2) = u(n)− u(n− 1),
uT,0(n+ 2) = u(n− 2)− u(n− 3,
uT,0(n+ 2) = u(n− 4)− u(n− 5).
(24)
Once two input samples are subtracted, the resultant is passed through a tapped delay line and L2 outputs are taken from
L
2
registers. Similar redundancies exist between (n− 1)th and (n+ 1)th time indices. This scheme can be extended to multiple
levels of decomposition using only adders/subtractors and registers. However, please note that un-normalized Haar wavelet is
non-orthogonal and this creates issue while computing the filter output, which need to be addressed and it is explained later.
The architecture of DWMPLMS with sliding wavelet implementation is explained in the next section.
D. Proposed Architecture
DWMPLMS architecture is shown in Fig. 9(a) and is similar to DMPLMS Architecture except for the sliding wavelet
transform block, muxed adder tree and power normalization block in each tap. These changes will be explained in detail.
In this subsection we explain the sliding Haar wavelet block and in subsequent sections we explain the others. Sliding un-
normalized Haar wavelet implementation is shown in the Fig. 9(b). It contains two identical blocks U-HAAR even and U-HAAR
odd blocks. U-HAAR even block is zoomed and shown separately. The switches shown in this U-HAAR even block closes at
the rate of clk2 so as to sample only non-overlapped even pairs like u(n), u(n− 1) and u(n+ 2), u(n+ 1). Similarly the odd
pairs u(n+ 1), u(n) and u(n+ 3), u(n+ 2) are sampled in U-HAAR odd block. Once the even pair is received, it is added
and subtracted to get the low and high frequency components and high frequency components are passed through a series
of registers which are clocked on clk2 which forms the tapped delay line and the
L
2 high frequency outputs come from this
delay line. The low frequency components are passed through the second level of decomposition. For second level difference
components, valid outputs are generated once in four clocks. Hence we require two registers which run on clk2 . If we use one
12
register which run on clk4 , we loose the intermediate results and hence output would be erroneous. Except the top latch shown
in green color, all other latches run on clk2 . Similarly in the third level, we need four registers running on
clk
2 between valid
outputs to capture all the intermediate results. In the first level, we need
(L−2)
2 registers for the difference. Similarly at the
second stage we need
(L−4)
2 registers and at the final stage we need
(L−8)
2 registers for detail coefficients and another
(L−8)
2
registers for average coefficients. We see that it is a register hungry design but the critical path is just three adders/subtractors
since there are no multipliers. Similarly the redundancies between u(n+1) and u(n−1) are exploited in the U-HAAR ODD
block. U-HAAR ODD block also has a same structure and also it runs on clk2 but in the complementary phase of U-HAAR
EVEN block clock. At the output there is a switch which touches one of the sides depending the even/odd number of clock and
the wavelet components are fed to the adaptive filter. We can observe that the critical path for both DMPLMS and DWMPLS
architectures is Tmult i.e. delay of one multiplier.
E. Un-normalized Haar Transform
The filter output of transform domain adaptive filters is given below. Since both filter weights and input are in transform
domain and if the transform is orthogonal, then the filter output remains unchanged as shown.
y(n) = wTT (n)uT (n)
= (Tw(n))TTu(n)
= wT (n)(TTT)u(n)
= wT (n)(I)u(n)
= wT (n)u(n).
(25)
However, since the Haar wavelet used above is not orthogonal, with 3-levels of decomposition (TTT) becomes
(TTT) =


8 0 0 0 0 0 0 0
0 8 0 0 0 0 0 0
0 0 4 0 0 0 0 0
0 0 0 4 0 0 0 0
0 0 0 0 2 0 0 0
0 0 0 0 0 2 0 0
0 0 0 0 0 0 2 0
0 0 0 0 0 0 0 2
,


(26)
and hence y(n) is given by
y(n) = 8[(w0u(n) + w1u(n− 1)] + 4[w2u(n− 2) + w3u(n− 3)]
+2[w4u(n− 4) + w5u(n− 5) + w6u(n− 6) + w7u(n− 7)].
(27)
In this result, half of the partial MAC (multiply-accumulate) terms are scaled up by a factor of 2, similarly half of the
remaining terms are scaled up by a factor of 4 and remaining terms by 8. These terms need to be scaled down by the same
factors to get the correct output y(n). we can achieve this by right shifting the result of the particular adder tree branch as
shown in Fig. 10.
>> 3 >> 2
>> 1
MUX MUX
MUX
+
++
+
+
++
+
++
+
++
+
Muxing Arrangement
Fig. 10: Mux Add Tree Architecture
Note that this arrangement is independent of the filter length. In the adder tree architecture, the MAC terms which are scaled
up by a factor of 2 always come from second branch of the adder tree and hence this result is right shifted by one bit position
13
0 128 256 384 512
-0.1
0
0.1
0.2
Samples (n)
A
m
pl
it
ud
e
(a) Highly sparse system (sparse network echo path
in ITU-T G.168), Sm = 0.8960.
0 128 256 384 512
-0.1
0
0.1
0.2
Samples (n)
A
m
pl
it
ud
e
(b) Semi sparse system, Sm = 0.5560.
0 128 256 384 512
-0.1
0
0.1
0.2
Samples (n)
A
m
pl
it
ud
e
(c) Non-sparse system, Sm = 0.3486.
Fig. 11: System impulse responses used in simulations.
0 1000 2000 3000 4000 5000 6000 7000 8000
-25
-20
-15
-10
-5
0
No
rm
ali
ze
d 
M
SD
 (d
B)
Iterations (n)
 
 
DNLMS
DPLMS
DMPLMS
DWMPLMS Proposed
DWMPLMS-SYM
(a) White input sparse system.
0 2000 4000 6000 8000 10000 12000 14000 16000
-25
-20
-15
-10
-5
0
No
rm
ali
ze
d 
M
SD
 (d
B)
Iterations (n)
 
 
DNLMS
DPLMS
DMPLMS
DWMPLMS Proposed
DWMPLMS-SYM
(b) White input semi sparse system.
0 2000 4000 6000 8000 10000 12000 14000 16000
-25
-20
-15
-10
-5
0
No
rm
ali
ze
d 
M
SD
 (d
B)
Iterations (n)
 
 
DNLMS
DPLMS
DMPLMS
DWMPLMS Proposed
DWMPLMS-SYM
(c) White input disperse system
0 2000 4000 6000 8000 10000 12000 14000 16000
-20
-18
-16
-14
-12
-10
-8
-6
-4
-2
0
N
or
m
ali
ze
d 
M
SD
 (d
B)
Iterations (n)
 
 
DNLMS
DPLMS
DMPLMS
DWMPLMS Proposed
DWMPLMS Sym4
(d) color input sparse system.
0 1 2 3 4 5 6
x 104
−20
−18
−16
−14
−12
−10
−8
−6
−4
−2
0
No
rm
ali
ze
d M
SD
 (d
B)
Iterations (n)
 
 
DNLMS
DPLMS
DMPLMS
DWMPLMS Proposed
DWMPLMS Sym4
(e) color input semi sparse system.
0 1 2 3 4 5 6
x 104
-20
-18
-16
-14
-12
-10
-8
-6
-4
-2
0
No
rm
ali
ze
d M
SD
 (d
B)
Iterations (n)
 
 
DNLMS
DPLMS
DMPLMS
DWMPLMS Proposed
DWMPLMS Sym4
(f) color input disperse system.
Fig. 12: Performance comparisons.
to scale it down by 2 and this doesn’t require any extra hardware.. Similarly the results of other two branches are right shifted
by 2 and 3 bit positions. The choice of un-normalized HAAR wavelet has three advantages. We could exploit the redundancies
in sliding wavelet transform through multiple levels of decomposition, we can get away with multipliers in calculating DWT
and finally no special hardware required for scaling down the output. Multiplexing arrangement is necessary in the adder tree
since we are folding the adder tree by a factor of 2 (just like the DMPLMS case) and scaling is not needed during calculation
of the denominator of proportional gain.
F. Absolute power normalization
In this section we explain the third difference between DMPLMS and DWMPLMS architectures i.e. power normalization.
After applying the orthogonal transform to the input, the input correlation matrix becomes diagonal but it doesn’t become
identity matrix. Thus we need to normalize the correlation matrix with the power of each tap to make it identity matrix. The
power of each tap is generally estimated using the following equation
Ψi(n) = βΨi(n− 1) + (1− β)(uT,i(n))
2, 0 ≤ i ≤ (L− 1). (28)
However, this adds a significant area penalty as it effects all the taps and to simplify this we used absolute of the input
instead of the square of the input and the new equation is shown below. We will show that the penalty for this approximation
14
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
M
ag
ni
tu
de
Sample #
(a) Speech i/p.
0 1 2 3 4 5 6
-25
-20
-15
-10
-5
0
5
N
or
m
al
iz
ed
 M
SD
 (d
B)
Time (sec)
 
 
DNLMS
DPLMS
DMPLMS
DWMPLMS Proposed
DWMPLMS Sym4 
(b) Tracking performance.
Fig. 13: Performance comparison for speech input.
Table III: Synthesis results for TS18 180nm CMOS technology
Algorithm # of taps Clock freq.(MHz) Leakage Power (µW) Dynamic Power (mW) Cell Area(kGE) a Core Area(µm2)
DLMS
16 100 0.609 38.03 41.1 539290
32 100 1.21 66.45 82.04 1085032
64 100 2.43 114.39 164.1 2176258
DPLMS
16 100 0.802 33.36 49.57 650031
32 100 1.57 61.95 97.26 1280869
64 100 3.13 119.45 193.47 2558840
DMPLMS
16 100 0.867 47.67 52.88 693356
32 100 1.70 91.72 103.83 1367065
64 100 3.39 176.97 206.98 2736847
DWMPLMS
16 100 1.007 58.31 66.18 863061
32 100 2.01 115.1656 130.84 1722799
64 100 3.39 229.8 260.13 3441479
a One gate equivalent corresponds to the size of a two input NAND gate of size 12.54µm2
is negligible through simulations which are explained in the next section.
Ψi(n) = βΨi(n− 1) + (1− β)|(uT,i(n))|, 0 ≤ i ≤ (L− 1). (29)
We choose β = 0.125 i.e. 18 to avoid multiplication and this power estimation block is zoomed and shown in Fig. 9(a). In
the next section we present detailed simulation results for the proposed algorithm.
V. SIMULATION RESULTS
For evaluating the performance of the DWMPLMS algorithm with all the proposed re-formulations and 3-level un-normalized
Haar wavelet, Experiments were performed in the context of echo cancelation which is the main application of sparse adaptive
filters. we considered three systems of varying sparsity. The first system is a network echo path from G168 Recommendation
and its impulse response can be considered to be very sparse since the associated sparseness measure is 0.89, the second one
is a semi sparse system with a sparsity measure of 0.5 and third system is a disperse system with a sparsity of 0.3. All the
impulse responses have 512 coefficients, using a sampling rate of 8 kHz. All adaptive filters used in the experiments have
the same length, i.e., L = 512. The far-end signal (i.e., the input signal) is either a white Gaussian signal or a color input
obtained by passing a white Gaussian input through an first order AR process with the pole at 0.95 and speech sequence.
The output of the echo path is corrupted by an independent white Gaussian noise (i.e., the background noise at the near-
end) with 30 dB echo-to-noise ratio (ENR). Adaptation step sizes for color input case are same as those mentioned in and
for white case, µdnlms = µdmplms = µdwmplms−sym4 = 0.25, µdplms = 0.22, µdwmplms−proposed = 0.15. Step sizes are
adjusted such that steady state MSD is equal. We can see from Fig. 12, For both color and white inputs, as the sparsity
decreases performance of DPLMS/DMPLMS becomes worse than that of DNLMS where as DWMPLMS is able to perform
better or equal to the DNLMS. For the network echo cancelation which comes under color input sparse system, DWMPLMS
significantly outperforms all the other three. We see that even if the sparsity or input correlation varies over wide range,
DWMPLMS is able to perform consistently. We also evaluated the tracking performance (by changing the echo path in the
midway by shifting the impulse response to the right by 12 samples) of the proposed algorithm using the speech input shown
in Fig. 13(a) for the sparse system and result is shown in Fig. 13(b). All other simulation parameters are same as that of color
case and µdnlms = 0.2, µdmplms = µdplms = 50, µdwmplms−sym4 = 2.5, µdwmplms−proposed = 1.1. Please note that we need
to adjust the step sizes depending on the input variance which is very low for the speech input and hence large value of µ for
15
DMPLMS and DPLMS. We can see that the performances are similar to color case and we can conclude that DWMPLMS
with U-HAAR 3-level decomposition is the robust VLSI solution to the echo cancelation with varying sparsity.
VI. VLSI IMPLEMENTATION RESULTS AND COMPARATIVE STUDY
In this section we provide the synthesis results of the proposed DWMPLMS, DMPLMS and DPLMS architectures and since
there are no previous architectures reported for them in literature, we compare the results with standard DLMS to see how
much is the complexity increase for the given convergence performance improvement. All the designs are implemented in
Verilog, synthesized using Synopsys Design Compiler with TS18 standard cell library (Tower Semiconductor 180nm CMOS
technology). 16-bit fixed point representation is considered for all the designs. Filter lengths of 16, 32 and 64 are considered
for comparing the scalability of time and area complexities and all the designs are targeted for 100 MHz for fair comparison.
The results are summarized in Table. III.
Compared to standard DLMS, re-formulated DPLMS has 20.53%, 18.04% and 17.57% more area complexity for 16, 32 and
64 taps respectively. We see that the complexity increase is not exactly linear and it is intuitive because of overheads like adder
tree, error calculation and G calculation blocks. Similarly DMPLMS has 28.56%, 25.99%, 25.75% area complexity increase
compared to DLMS. So roughly with 25% increase in hardware we are getting 3x improvement in convergence performance
in case of white input. Finally for color input when we consider DWMPLMS, it has 60.03%, 58.77%, 58.13% area increase
compared to DLMS complexity for 16, 32 and 64 taps respectively. So we can conclude that with 58% increase in hardware
we are able to achieve 15x improvement in convergence for color input.
Finally we realized one higher order filter with 256-tap for the proposed DWMPLMS and the synthesis results at 100MHz
target frequency are shown in Table. IV and the detailed area break-up is shown in Table. V. We see that filter taps occupy
most of the of area, followed by sliding HAAR unit which occupy around 10% of the area. However, one can notice that the
core area is on the higher side but thats because of 180nm process and with more advanced technology nodes it will be scaled
down accordingly.
Table IV: Synthesis Results of 256-tap DWMPLMS
Parameter Value
Clock freq 100 MHz
Dynamic Power 417 mW
Leakage Power 16.9 µW
Cell Area 1037 kGE
Core Area 13.78 mm2
Table V: Area Break Down of the Sub-blocks
kGE %
E Calculation 1.62 0.16
Adder Tree 27.98 2.7
256 Taps 898.75 86.62
Sliding HAAR unit 109.1 10.52
Total 1037.45 100
VII. CONCLUSIONS AND FUTURE WORK
Proportionate-type adaptive algorithms can significantly improve the convergence performance of the sparse adaptive filters
compared to conventional LMS algorithm. However, the huge computational penalty associated with these algorithms make
their VLSI realization a highly challenging task even with the most advanced technology nodes. By utilizing the fact that
stochastic gradient algorithms are tolerant to approximations, we proposed several re-formulations in this paper to simplify
the original PNLMS algorithms and compared the performances of these algorithms with those of the original algorithm and
proposed an efficient VLSI architectures for these re-formulated algorithms. We also demonstrated that the DWMPLMS with
the proposed 3-level un-normalized HAAR wavelet is a robust VLSI solution for the practical echo cancelers with time varying
sparsity. This is the first attempt to implement proportionate type algorithms in hardware and this result will motivate other
researchers to explore more efficient hardware solutions to further improve the sparse adaptive filter architectures.
REFERENCES
[1] S. Weinstein, “Echo Cancellation in the Telephone Network”, in IEEE Communi- cations Magazine, vol. 15, iss. 1, pp. 8-15, Jan. 1977.
[2] M. Stojanovic and J. Preisig, “Underwater Acoustic Communication Channels: Propagation Models and Statistical Characterization,” in IEEE Communi-
cations Magazine, pp. 84-88, January 2009.
[3] L. Fan, C. He, D. Wang, and L. Jiang, “Efficient Robust Adaptive Decision Feedback Equalizer for Large Delay Sparse Channel,” in IEEE Trans. on
Consumer Electronics, vol. 51, No. 2, pp. 449-456, May 2005.
[4] C. Paleologu, J. Benesty, S. Ciochin, Sparse Adaptive Filters for Echo Cancellation, 2010, Morgan & Claypool.
16
[5] D. L . Duttweiler, “Proportionate normalized least-mean-squares adaptation in echo cancelers”, in IEEE Trans. Speech Audio Process., vol. 8, no. 5, pp.
508-518, 2000.
[6] S. Haykin and B. Widrow, Least-Mean-Square Adaptive Filters. Hoboken, NJ, USA: Wiley, 2003.
[7] S. Gay, “An efficient, fast converging adaptive filter for network echo cancellation,” in Proc. 32nd Asilomar Conf. on Signal and System for Computing,
vol. 1, 1998, pp. 394-398.
[8] J. Benesty and S. L. Gay, “An improved PNLMS algorithm,” in IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), vol. 2, pp.1881-1884, 2002.
[9] H. Deng and M. Doroslovacki, “Proportionate Algorithms for Echo Cancellation”, in IEEE Trans. on Signal Processing., vol. 54, no. 5, pp. 1794-1803,
May 2006.
[10] H. Deng and M. Doroslovacki, “Improving convergence of the PNLMS algorithm for sparse impulse response identification,” in IEEE Signal Process.
Letters, vol. 12, no. 3, pp. 181-184, 2005.
[11] K. Wagner, M. Doroslovacki, and H. Deng “Convergecne of proportionate-type NLMS adaptive filters and choice of gain matrix,” Proc. 40th Asilomar
on Sig- nals, Systems, and Computers, Pacific Grove, CA, Oct. 29 - Nov. 1, 2006.
[12] H. Deng and M. Doroslovacki, “Wavelet-Based MPNLMS Adaptive Algorithm for Network Echo Cancellation,” EURASIP Journal on Audio, Speech,
and Music Processing., vol. 2007, Oct. 2007.
[13] Dyba, Roman A and He, Perry P and Pessoa, Lu´cio FC, Network Echo Cancellers and Freescale Solutions Using the StarCore SC140 Core , 2004,
Freescale Application Note AN2598/D.
[14] Chen, Jie, Efficient VLSI Architectures for High-Speed Ethernet Transceivers, , 2008, ProQuest.
[15] V. C. Gogineni, S. Mula, R. L. Das and M. Chakraborty, “Performance analysis of proportionate-type LMS algorithms,“ 2016 Signal Processing:
Algorithms, Architectures, Arrangements, and Applications (SPA), Poznan, Poland, 2016, pp. 177-181.
[16] G. Long, F. Ling, and J. G. Proakis, “The LMS algorithm with delayed coefficient adaptation,“ in IEEE Trans. Acoust., Speech, Signal Process., vol.
37, pp. 1397-1405, Sep. 1989.
[17] P. K. Meher and S. Y. Park, “Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter With Low Adaptation-Delay,”in IEEE Trans., Very Large
Scale Integr. (VLSI) Syst., vol. 22, no. 2, pp. 362-371, 2014.
[18] T. C. Denk and K. K. Parhi, “Exhaustive scheduling and retiming of digital signal processing systems,“ in IEEE Trans., Circuits Syst. II, Analog Digit.
Signal Process.,, vol. 45, no. 7, pp. 821-838, Jul 1998.
[19] S. F. Oberman and M. Flynn, “Division algorithms and implementations,“ IEEE Trans. Comput., vol. 46, no. 8, pp. 833-854, Aug. 1997.
[20] J. N. Mitchell, Jr., “Computer multiplication and division using binary logarithms,“ in IRE Trans. Electron. Comput., vol. 11, no. 11, pp. 512-517, Aug.
1962.
[21] D. F. Marshall, W. K. Jenkins and J. J. Murphy, “The use of orthogonal transforms for improving performance of adaptive filters,“ in IEEE Trans.,
Circuits Syst.,, vol. 36, no. 4, pp. 474-484, Apr 1989.
[22] K. C. Ho and S. D. Blunt, ”Adaptive sparse system identification using wavelets,” in IEEE Trans., Circuits Syst. II, Analog Digit. Signal Process.,, vol.
49, no. 10, pp. 656-667, Oct 2002.
[23] G. Strang and T. Nguyen,Wavelets and Filter Banks. Cambridge, MA: Wellesley-Cambridge , 1996.
[24] Samir Attallah, “The Wavelet Transform-Domain LMS Algorithm: A More Practical Approach,“ in IEEE Trans., Circuits Syst. II, Analog Digit. Signal
Process.,, vol. 47, no. 3, pp. 209-213, Mar 2000.
