Abstract-Fine-grain pipelined adaptive decision-feedback equalizer (ADFE) architectures are developed using the relaxed look-ahead technique. This technique, which is an approximation to the conventional look-ahead computation, maintains functionality of the algorithm rather than the input-output behavior. Thus, it results in substantial hardware savings as compared to either parallel processing or look-ahead techniques. Pipelining of the decision feedback loop and the adaptation loop is achieved by the use of delay reluxution and sum relaxation. Both the conventional and the predictor form of ADFE have been pipelined. Results of the convergence analysis of the proposed algorithms are also provided. The performance of the pipelined algorithms for the equalization of a magnetic recording channel is studied. It is shown that the conventional ADFE results in an SNR loss of about 0.6 dB per unit increase in the speed-up factor. The predictor form of ADFE is much more robust and results in less than 0.1 dB SNR loss per unit increase in the speed-up factor. Speed-ups of up to 8 and 45 have been demonstrated for the conventional and predictor forms of ADFE.
I. INTRODUCTION ESIGN OF DIGITAL signal processing (DSP) algo-
D rithms for high-throughput applications, such as video compression, are currently of great interest. In the area of digital communications, there is a growing need for high-speed equalizers for applications such as high-density magnetic storage systems, subscriber loop applications and mobile radio. The adaptive decision-feedback equalizer (ADFE) has been employed successfully for combating inter-symbol interference (ISI). However, the ADFE has remained difficult to pipeline and in this paper we propose a novel approach for fine-grain pipelining of the ADFE.
Two popular approaches for achieving high processing speed are pipelining [l] and parallel processing [2] . From a single chip implementation point of view the pipelining approach holds a distinct advantage due do its lower hardware cost. Recently, the utility of pipelined systems in low-speed, low-power applications such as speech codecs in mobile phones and other portable applications has also been observed [3] . Conventionally, algorithm transformation techniques [4] such as look-ahead have been employed to introduce concurrency in serial algorithms. Originally proposed for pipelining Manuscript received September 8, 1993; revised November 10, 1994 . This research was supported by the Army Research Office under Contract DAAL-90-G-0063. The associate editor coordinating the review of this paper and approving it for publication was Dr. J. of recursive fixed-coefficient filters [l] , [SI, [6] , the lookahead technique has been successfully applied to pipeline two-dimensional recursive filters [7] , dynamic programming [8] , [9] , finite state machines [lo] , quantizer loops [ll] , and adaptive digital filters [12] , [13] . The implementation of a recursive filter chip capable of 86 million operations per second [ 141 has clearly demonstrated the feasibility of finegrained pipelining.
The look-ahead technique, however, results in a large hardware overhead as it transforms a serial algorithm into an equivalent (in the sense of input-output behavior) pipelined algorithm. While past research demonstrated the use of lookahead in the design of pipelined recursive signal processing algorithms, our current research is concerned with the design of inherently pipelinable algorithms. These algorithms have more than one delay element in any feedback loop and therefore can be pipelined without requiring any hardware increase as compared with the sequential ones. To this end we have developed the reZaxed Zook-ahead technique [ 151 for the pipelining of adaptive digital filters. The relaxed look-ahead sacrifices the equivalence between the serial and pipelined algorithms at the expense of marginally altered convergence characteristics. Therefore, it maintains the functionality of the algorithm and is well suited for adaptive filtering applications.
Relaxed look-ahead involves approximating the algorithms obtained via look-ahead. A number of approximations are possible and each would result in a different algorithm. For example, in this paper we present the delay relaxation and the sum relaxation as two possible approximations, which can be used for pipelining of the ADFE. In the context of adaptive filtering, the approximations can be quite crude and yet result in minimal performance loss. However, in all cases, the resulting pipelined algorithm requires minimal hardware increase and achieves a higher throughput or requires lower power as compared to the serial algorithm. Note that, the relaxed look-ahead has already been employed for the pipelining of the least mean-squared (LMS) filter [16] , the stochastic gradient lattice filter [ 171, the adaptive differential vector quantizer [ 181, and the adaptive differential pulse code modulation codec [ 
161.
As mentioned before, fine-grain pipelining of the ADFE is known to be a difficult problem. This is mainly due to the fact that the ADFE has a nonlinear element (a quantizer) in the decision-feedback loop ( D E ) . The conventional ADFE (to be referred to as ADFE) in Fig. l(a) and the predictor form of ADFE (to be referred to as predictor ADFE) in filter (FBF), the quantizer (Q), and the coefficient update blocks WUC (for FFF) and WUD (for FBF). The delays A are employed to adjust the position of the main tap of FFF, which is usually the center-most tap. In addition to the DFL, the presence of the adaptation loop makes it even more difficult to achieve pipelining. Hence, past work [ 191-[22] in high-speed ADFE architectures have almost exclusively adopted parallelization. In general, the algorithms in [20]-[22] result in a performance loss due to incorrect initialization of the feedforward filter (FFF) and a coding loss due to the initialization of feedback filter (FBF). The ADFE architecture in [21] has an advantage of offering a performance gain as it decomposes the channel into a set of parallel IS1 free channels. However, it requires the knowledge of the channel coefficients. In [23] , circuit and architectural techniques, such as the use of the transpose form for the FFF and the FBF, have been employed to achieve high-speed in a DFE. A VLSI implementation of an ADFE with two delays in the DFL has been proposed [24] . Performance degradation was observed in [24] due to the fact that the FBF cannot cancel the most significant IS1 term.
From the discussion above it is clear that a fine-grain pipelined AD€% algorithm with minimal hardware overhead and a negligible performance loss would be desirable. In this paper, we employ relaxed look-ahead to develop such algorithms. To this end we first generalize an existing straightforward pipelining approach to obtain the pipelined ADFE architecture referred to as PIPADFEl (to be read as Pipe "1).
Then we propose two new pipelined ADFE architectures referred to as PIPADFE2 and PIPADFE3 which are pipelined versions of the conventional and the predictor ADFE structures, respectively. We conclude that PIPADFE2 can converge much faster than PIPADFEl and is an attractive alternative for implementation of conventional ADFE algorithms. Furthermore, we also show that PIPADFE3 is a robust pipelined architecture with respect to level of pipelining or speed-up factor. These algorithms require much smaller hardware overhead and are attractive from VLSI implementation point of view. In addition, these algorithms do not suffer from any coding loss. Similar to the parallel architectures [20] , [22] , the pipelined architectures do suffer from performance degradation as the speed-up increases although for a different reason. This degradation is mainly due to the coarseness of the approximations made in applying the relaxed lookahead. Therefore, by improving these approximations it is possible to reduce the performance loss. This is a desirable flexibility offered by relaxed look-ahead. The performance of the proposed algorithms is compared with a generalized version of the algorithm proposed in [24] . This paper is organized as follows. In Section II, we present the relaxed look-ahead technique, which is then applied to pipeline the ADFE in Section 111. In Section IV, we analyze the performance of the pipelined algorithms and compare them with that of the serial algorithm. Conve'rgence analysis results are presented in Section V. Simulation results are presented for the equalization of a magnetic recording channel in Section VI.
THE RELAXED LOOK-AHEAD
In this section, we introduce the relaxed look-ahead as an approximation to the look-ahead. In order to do this, consider the following equations, which describe a linear adaptive estimator with a first-order weight-update recursion
where W(n) is a N x 1 vector of coefficients of the filter FIR (see Fig. 2(a) ), , U is the adaptation step-size, e(. ) is the estimation error, X(n) is the N x 1 input vector, and s ( n ) is the desired signal. The first-order recursion (la) also describes the weight-update recursion of the ADFE. Therefore, the relaxed look-ahead pipelining results discussed in this section in the context of this first-order recursion can be directly applied to pipeline the ADFE (see Section 111). From Fig. 2 (a) (and (l)), we can see that there are two major feedback paths, which present a bottleneck for high throughput applications. The first is called the error feedback path, which consists of the filter FIR, the adder, and the weight-update block WUC. The second path is the weight-update recursion defined by (la). In order to break this bottleneck, we can apply the look-ahead [ 11 pipelining technique. An A4-stage pipelined algorithm can be derived from (1) by the application of an M -step look-ahead to (1). It can be easily checked that the hardware required to do so is quite large because this process involves computing W(n) from W(n -M). Note that the look-ahead transformation results in M latches in the recursive loop. These latches can be redistributed or retimed [25] to pipeline the feedback multiply-add operation by M levels.
However, by considering the error feedback path and weight-update recursion separately, we can pipeline the adaptive estimator in a hardware efficient manner. In particular, we pipeline the error feedback path by the delay relaxation, while the weight-update recursion is pipelined by the sum relaxation.
A. Delay Relaxation
The delay relaxation is shown in Fig. 2(b) , where the error e(.) and the input X(n) are delayed by D1 samples before being employed in the WUC. This transformation is made on the basis of the assumption that the gradient estimate e ( n ) X ( n ) does not change substantially over D 1 clock-cycles. The delay relaxation has been employed in [26] , [27] to develop the "delayed LMS" algorithm. A thorough convergence analysis of this kind of delayed adaptation scheme was also done in [26] , [27] . It was concluded that, in a stationary environment and with a small step-size, the degradation in convergence speed and adaptation accuracy is negligible for small delays. As we shall see later, the degradation in convergence behavior is also dependent on the algorithm topology. However, from an architectural point of view the delay relaxation is an effective method for pipelining. This is because the D1 delays can now be employed to pipeline the FIR and hardware overhead is just the pipelining latches. Note that the "delayed LMS' algorithm is a special case of the filtered X-LMS algorithm [28] .
Another application of delay relaxation is illustrated in Fig. 2(c) , where pipelining by placing latches at the inputs of the system is combined with the delay relaxation. This structure is retimed further to obtain the transformed system. As mentioned before, the structure in Fig. 2(c) can also be obtained from the filtered X-LMS algorithm [28] by replacing the plant P ( z ) by D1 delays. The delay relaxation can be justified as after convergence the weights do not change much. This implies that some degradation in performance is to be expected while the filter is converging. Note that the D1 delays would be redistributed to pipeline the FIR.
Redistribution of D1 delays results in the desired signal s ( n ) being delayed by a certain amount D', where D ' 5 D1. In Fig. 2 , D' equals the number of delays needed to pipeline FIR. In data communications applications, FIR is an equalizer and e(n) is the input to a slicer (or quantizer). In an echocancellation scenario, e(n) is the sum of residual echo and the received signal. This leads to an additional end-to-end delay of D' symbol periods (assuming baud-rate processing) for both cases and this is usually not a problem. However, if the FIR is the FBF in an ADFE, any delay in its output implies that the FBF is unable to cancel D' post-cursor IS1 terms and would result in a performance degradation if it is not compensated for.
B. Sum Relaxation
Even though the delay relaxation is sufficient to pipeline the FIR and part of the WUC, the weight-update loop (la) remains to be pipelined. The computation time of (la) is lower bounded by a single add time. In order to reduce this lower bound, we apply a D2-step look-ahead to (la) to obtain an overall improvement in the absolute convergence time is possible. This will occur if the degradation in the convergence speed due to the relaxations is by a smaller factor than the speed-up. As we shall see later, this is true for all the pipelined architectures presented here. If the input sample-rate does not need to be increased, then the increase in throughput can be traded-off with area, by employing systematic folding transformations [29], or power [3] .
In this section, we employ the relaxed look-ahead technique, which was described in the previous section, to develop three pipelined ADFE architectures. For the sake of simplicity and
Note that due to the inter-dependence of e ( n ) and
only would result in a large computational complexity. Hence, we implicitly assume that (2) will be employed along with the delay relaxation.
In (2), the summation term represents the overhead. However, instead of taking the sum of D2 terms in (2), we may retain only L A terms to get
where the partial look-ahead factor L A maybe either less than or equal to Dz. The replacement of D2 sum terms in (2) to LA sum terms in (3) is referred to as the sum relaxation. Note that the summation in (3) can be realized by computing the product e(n)X(n) and then passing it through an FIR filter whose coefficients are all equal to unity. This FIR filter can be realized in an equivalent transpose form. In that case, the computational delay for the summation would be independent of LA. There is, however, an overhead of N ( L A ) adders for this relaxation. Each of the relaxations can be applied individually or in combination. Therefore, unlike the conventional look-ahead technique [l] , the relaxed look-ahead results in a rich variety of architectures. Depending on the application at hand certain approximations may be more appropriate than others. In addition, by tuning the adaptation parameters (such as the step-size, etc.) the degradation in performance due to the approximations can be minimized.
The pipelined architecture resulting from the application of relaxed look-ahead can always be clocked at a higher speed than the original one. This increase in throughput (also referred to as the speed-up), which is in direct proportion to the number of pipeline stages, can be exploited in many ways. If the input sample-rate is increased to match the throughput then i=O to demonstrate the technique of relaxed look-ahead, we only consider the equalization of channels with linear ISI. This, however, does not preclude the application of relaxed lookahead to pipeline equalizers such as the RAM-DFE [30] , which has been applied successfully to cancel nonlinear ISI. The channel model we consider is shown in Fig. 3 , where U ( . )
is the channel input at time instance n, h(n) is the channel coefficient vector, ~( n ) is white Gaussian noise and x(n) is the received sample.
The first algorithm, referred to as PIPADFEl (pipelined ADFEl), is an extension of the algorithm proposed in [24] . This algorithm has been derived simply for the sake of comparison with the other architectures. The second pipelined ADFE algorithm (PIPADFE2) is derived by the application of relaxed look-ahead to the ADFE in Fig. l(a) , while the third algorithm (PIPADFE3) is developed from the predictor ADFE (see Fig. l(b) ). For simplicity, the pipelined algorithms are derived assuming correct quantizer decisions.
First, we introduce some terminologies to define the equations which describe the serial ADFE of Fig. I(a) .
is the output of FFF, U B (~) is the output of the FBF, C ( n ) is the vector of FFF coefficients, D(n) is the vector of FBF coefficients, X ( n ) is the vector of received samples, a(n) is the vector of detected symbols, &(n) is the input to quantizer Q and &(n) is the quantizer decision. In addition, let NF and N B represent the number of taps of the FFF and FBF, respectively. Note that (40 is of the same form as (la) and represents the familiar least mean-squared (LMS) algorithm NO. 6. JUNE 1995 with p being the adaptation step-size. The vector W(n) is the combined coefficient vector defined as
The data vector U(n) is given by UT(n) = [XT(n) a(n -1)7.
(6)
Note that when correct decisions are made by the quantizer then
A. The PIPADFEI Algorithm
The FFF and FBF can be pipelined by delaying their inputs by D1 samples and then applying the form of delay transfer relaxation shown in Fig. 2(c) . We shall see later that introducing D1 delays in the DFL results in a substantial performance degradation because the FBF cannot employ past decisions to cancel the D1 most-significant IS1 terms. This fact has been observed in [24] , where D1 = 2 delays were introduced in the DFL. These steps result in (4a) being modified to
while (40 is modified to
where the new data vector U l ( n ) is defined as
with G(n) = a(. -D 1 -A ) for correct quantizer decisions. Applying sum relaxation to (9) results in
Finally, we approximate (4b) by employing (1 1) as follows
with Ul(n) defined as in (10). The hardware overhead for PIPADFEI are the pipelining latches and (LA -~) ( N F
adders. In order to demonstrate the increase in throughput due to pipelining, we consider the serial ADFE in Fig (12), we have assumed that the step-size p is sufficiently small. In a similar fashion, we can approximate (4c) as
(13)
These steps result in an algorithm which is a generalization of the algorithm presented in [24] . The equations describing where d;(n)'s are the FBF coefficients and NB is its order. In order to apply look-ahead, we can either linearize the DFL or employ the technique in [ll] . In order to minimize the hardware, we choose the former approach. Note that the I @ 8 1
Pipelining example: (a) Serial ADFE; (h) PIPADFE1 with a speedup linearization of the DFL is simply an intermediate step in the process of developing PIPADFJ32. We assume (1) d;(n)'s and ci(n)'s vary slowly, and (2) the quantization error is small, i.e., G(n) M ii(n). Therefore, (15) can be approximated as Fig. 6 . PIPADFE2 architecture.
Note that we have implicitly assumed that NB -1 > D 1 . We will now apply look-ahead to (16) and then approximate the resulting expression. Employing (16) itself to substitute for
In (1 7), the variable a( n -1) has been eliminated. Similarly, we can carry on this process by substituting for G(n-i-il-2) (in (17)) in the summation enclosed within braces.
Repeating the above process of repeated substitution until variables 6(n -1) to ?L(n -0 1 ) are eliminated, we get
where A represents the computations of the data preprocessing section (PP) defined as shown below becomes negligible. If the combined channel and FFF impulse response has terms greater than unity, then the contribution from the third and higher terms in (19) can be accounted for (in the adaptive case) by increasing the number of terms in the second summation.
This implies that D1 delays have been introduced at the input to the FBF. 5) All the FBF coefficients of the serial ADFE appear in B. In the light of points 2) and 3) above, we approximate A as follows expressions. In order to guide us in making these approxima-
The appearance of PP is in accordance with the well known fact that introduction of delays in the D E necessitates the use of a higher order FFF to compensate for the IS1 terms not canceled by the FBF. This is interesting in the light of the fact that only an algorithm transformation technique, i.e., look-ahead, was employed to arrive at this conclusion. This also provides some justification for linearizing the DFL in the first place. The PP derives all its coefficients (except the one, which operates on the most recent input) from those of the FBF of the serial ADFE. In addition, the coefficients of FBF, which appear in PP, are di, 0 5 i < D1. Hence, these D1 coefficients of the FBF are required to cancel the D1 most significant IS1 terms as in the serial case.
If the impulse response of the combined channel and FFF is such that the magnitude of all the IS1 terms (and therefore the FBF coefficients) is less than unity, then the contribution from the third and higher terms in (19) Even with this seemingly crude approximation, it will be shown via simulations that PIPADFE2 converges twice as fast as PIPADFEl. Clearly, the performance of PIPADFE2 can be enhanced further by having sophisticated approximations (at the expense of increased hardware) in place of (21) and (22).
The pipelined DFL computation for PIPADFE2 can be written as
where we have replaced the quantizer in the DFL.
Next, we delay input to the PP (and the training input) by D1 samples and then apply the delay transfer relaxation first to PP and then to FFF. This results in the presence of D1 delays at the output of FFF. In a similar fashion, we apply the delay relaxation of Fig. 2(c) to FBF to transfer D1 to its *** serial AD=. +++ PPADFEl where W(n) = W(n -0 2 )
Distance from cbrsor output. Finally, employing the sum relaxation as in the case of PIPADFEl, we obtain the following equations which describe PIPADFE2
and ii(n) = a(n -D1 -A) for correct quantizer decisions.
The architecture for PIPADFE2 is shown in Fig. 6 . The hardware overhead for PIF'ADFE2 are 0 1 multipliers due to the PP, and (LA -1) (NF + N B ) adders (if the weight-update loop is also pipelined). This, however, does not reduce the clock frequency because the D1 latches can be employed to pipeline PP as well. Note that it is possible to apply the delay relaxation to PP and FFF with a higher value of D1 say D i , while keeping the number of pipelining latches in the DFL at a constant value of D1.
C. The PIPADFE3 Algorithm Fig. l(b) , which is described by the following expressions In this section, we pipeline the serial predictor ADFE in
where ET(n) = [e(n), e ( n -l ) , . . . , e(n -NB + l)] and ii(n) = a(n -A) for correct quantizer decisions. From (26a), (26c), and (260, we see that the FFF adapts independently of the FBF. This feature of the predictor ADFE allows us to pipeline it to much higher levels than the conventional ADFE. First, we transform (26b) into
which can be derived by substituting for &(n) (from (26c) and G(n) (from (26d) in (26b)). We shall now pipeline the predictor ADFE by applying the delay relaxation and the sum relaxation. First, we apply the delay relaxation (see Fig. 2(b) ) to the FFF weight-update (26a) to get
where ak(n) = D T ( n -1)E(n -D1 -1). Note that this step has already introduced D1 latches in the DFL because e(n) is present in it. In addition, (26b) has also been modified to (28b) as e ( n ) is an input to FBF. Next, we apply the sum relaxation to (28) to obtain
Finally, we delay the input to the FFF by 0 3 delays, where 
LA-1

~( n )
= D(. -0 2 )
with 6(n) = a ( n -0 3 -A) for correct quantizer decisions. In Fig. l(b) , we can easily confirm that the critical path consists of the FBF, adder, quantizer, adder and WUC. From 
IV. PERFORMANCE ANALYSIS
In this section, we will analyze and compare, in qualitative terms, the performance of the serial ADFE, PIPADFEl, PIPADFE2, and PIPADFE3. To do this we have simulated the performance of the equalizers for a magnetic recording channel with a channel SNR of 20 dB. The channel coefficients ([0.2, 0.6, 1.0: -1.0, -0.6, -0.21) were obtained from a Lorentzian pulse model [31] with the symbol period being one-half of the width of channel step response pulse at a height of 50% of the maximum.
The FFF in all the equalizers attempts to cancel the precursor ISI. To see this, we plot [see Fig. 8(a) ] the pulse response of the combined channel and FFF for a serial ADFE. The nonzero postcursor IS1 is canceled by the FBF coefficients.
In the case of PIPADFEl (and PIPADFE21, the FBF output is delayed by 0 1 samples. Therefore, the FBF cannot cancel the first 0 1 postcursor IS1 terms and the burden of canceling them falls on the FFF. This is also indicated in Fig. 8(a) , where the combined channel and FFF pulse response for PIPADFEl with D1 = 2 is shown. Clearly, as 0 1 increases the performance of PIPADFEl degrades and approaches that of a linear equalizer.
From the discussion above, it is clear that the performance of a pipelined ADFE algorithm can be improved substantially (especially for large 0 1 ) if the FFF does not have to cancel the postcursor ISI. This is exactly what PIPADFE2 achieves. In Fig. 8(b) , we show the combined channel and FFF for PIPADFE2 with D1 = 2. Just as in the case of the serial ADFE, the FFF in PIPADFE2 only cancels the precursor ISI. However, in this case the first 0 1 postcursor ISI's are canceled by the PP, whose coefficients are derived from those of FBF. This can be confirmed by plotting the pulse response of the system consisting of the channel, PP and FFF. Finally, the remaining postcursor ISI's are canceled by the FBF.
Even though it can be shown [32] that the predictor ADFE (see Fig. l(b) ) is equivalent to the conventional ADFE (see Fig. l(a) ) when the number of taps are infinite, in actual practice the predictor ADFE may perform worse than the conventional form. This is because in the predictor form the FFF and the FBF minimize two different error signals. Our interest, in this paper, is in comparing the performance of the serial predictor ADFE and PIPADFE3. However, it has been shown in [33] that for applications such as 1.544 Mbps asymmetric digital subscriber loop (ADSL), the predictor form does, in fact, perform better than the conventional ADFE. In the predictor form, the FFF adapts independently of the FBF and therefore, we were able to pipeline it by employing the delay relaxations and the sum relaxation. Thus, we should expect any performance degradation due to pipelining to be similar to that of the pipelined LMS [16] . In [16] , it was found that these relaxations result in a minimal loss in performance even for very high speed-ups. As we shall see later, via convergence analysis and simulations, that the same holds true for PIPADFE3.
In Section VI, we will further confirm the conclusions of this sub-section by comparing the performance of PIPADFEl, PIPADFE2 and PIPADFE3 as D1 increases.
V. CONVERGENCE ANALYSIS
In this section, we will analyze the convergence behavior of PIPADFEI, PIPADFE2, and PIPADFE3. In particular, the analytical expressions for the bounds on p for convergence in the mean-squared sense and for the adaptation accuracy (in terms of the misadjustment M) are provided. The misadjustment M [34] is defined as (31) where ~( n ) is the average value of the mean-squared error at time instance n and &,in is the minimum mean-squared error.
The analysis in this section is based on the results of the convergence analysis of the pipelined LMS algorithm in [16] . For mathematical tractability, we assume that the quantizer decisions are correct, and that the assumptions in the independence theory [35] are applicable. In addition, we only consider special cases of PIPADFEl
The details of the derivation are provided in the Appendix, and only the results are presented in this section.
~( m )
-Emin Emin
M = A. PIPADFEI
The bounds on p for convergence in the mean-squared sense is given by
in the mean-squared sense is lowered. This is not a drawback of PIPADFEI because practical values of p are much smaller than this upper bound.
The misadjustment for PIPADFEI is given by
Note that for very small values of p, the third term in the denominator is negligible as compared to the first two terms. Hence, as K increases, the misadjustment of PIPADFEI would increase. However, in practice, this increase in the misadjustment is negligible as seen in [16] .
B. PlPADFE2
The bounds on p for PIPADFE2 and the misadjustment are also given by (32) and (34), respectively, with where 6(n) is a time-varying function due to the presence of PP and R is the autocorrelation matrix of the data vector U4(n) defined in (A20).
Further analysis of PIPADFE2 is made difficult due to the fact that it is not easy to characterize S(n), which depends on the Fl3F coefficients and X(n).
C. PIPADFE3
Analysis of PIPADFE3 is made convenient by the fact that the FFF adapts independent of the FBF (see (30) 
The denominator and the numerator in (32) are a quadratic and linear functions, respectively, of K . This clearly indicates that as K or D1 (with fixed D2) increases, the upper bound on p for convergence mean-(364 and 'r, is given by
where N is the number of channel coefficients.
The misadjustment for FFF is given by
The bounds on p for the FBF are also given by (32) with where R' is the autocorrelation matrix of the vector E'T(n) =
[e(n -D1 + D2 -l), e(n -D1 + D2 -2 , .. . , e(n -D1+
D2 -N B ) ] and the misadjustment is given by
Note the dependence of (39a) and (39b) on the misadjustment of the FFF. This is to be expected because the input to the FBF is the error signal generated by the FFF.
VI. SIMULATION RESULTS
In this section, we present extensive simulation results in order to compare the performance of the pipelined algorithms with each other and with the serial ADFE. All simulations for a magnetic recording channel have been performed with a Lorentzian model, whose coefficients are defined in Section IV. For PIPADFE1 and PIPADFE2, values of N p = 13 and N B = 10 were chosen, while the corresponding values for PIPADFE3 were NF = 20 and NB = 10. This choice of NF and N B were made in order to get a positive noise margin at the slicer for a nominal channel SNR, where the channel SNR is defined as the ratio of channel output power to the noise power. The nominal channel SNR for storage channels was found to be about 22 dB [36] and an output SNR (i.e., SNR at the slicer) of 16 dB was required for a byte error rate of lop7 or less. Hence, we take this value of output SNR as the lower limit of acceptable performance.
The first four simulations study the effect of pipelining on the convergence speed (Experiment A), output SNR (Experiment B), performance in presence of channel nonstationarity (Experiment C) and the effectiveness of sum relaxation (Experiment D). In the fifth simulation, we consider an infinite impulse response (IIR) channel [22] and study the performance of PIPADFE 1, PIPADFE2 and PIPADFE3.
A. Convergence Speed
In order to study the effect of pipelining on the convergence speed, the step-sizes were chosen such that an output SNR of 20 dB is achieved. In Fig. 9(a) , we plot the mean-squared error (MSE) curve for PIPADFEl. Note that as D1 increases, the algorithm takes longer to converge. For D1 = 7, PIPADFE1 takes about 3.5 times longer than the serial ADFE. In case of PIPADFE2 (see Fig. 9(b) ), the degradation in convergence speed is reduced. In fact, PIPADFE2 achieves the same output SNR as PIPADFEl in about half the number of samples.
With PIPADFE3 (see Fig. 9(c) ), the degradation in convergence speed is negligible even with D1 = 0 3 = 22. Note that this value of D1 and 0 3 corresponds to a speed-up of 45. As mentioned before (in Section IV), the performance degradation of PIPADFE3 is expected to be similar to that of the pipelined LMS [16] algorithm. This is now verified. Finally, in Fig. 10 , we plot the convergence time-constant (i.e., number of samples required to converge) for different speed-ups. It is clear that PIPADFE2 is twice as fast as PI-PADFE1 for speed-ups greater than 3. In addition, PIPADFE3 can achieve very high speed-ups with negligible degradation in convergence speed.
NON-STATIONARY ENVIRONMENT
-
B. Output SNR
The purpose of this experiment is to determine how the performance of PIPADFEl, PIPADFE2 and PIPADFE3 degrade as speed-up is increased. The step-sizes for the simulations in this sub-section were chosen such the convergence timeconstant at different speed-ups is the same. In case of PI-PADFE1, this time-constant was 1700 samples, while that for PIPADFE2 was 900. From Fig. 11 , it can be seen that both PIPADFEl and PIPADFE2 have an SNR loss of about 0.6 dB per unit increase in the speed-up. The maximum achievable speed-up depends upon the value of convergence time, which is acceptable. For the time-constants mentioned above, both PIPADFEl and PIPADFE2 can achieve speed-ups of up to 8 with a 4 dB noise margin. From Fig. 11 , we can also see that degradation in output S N R for PIPADFE3 is less than 0.1 dB for speed-ups of at least 45. The convergence time-constant for PIPADFE3 was kept fixed at 800. As PIPADFE3 can be pipelined at very high-speeds, therefore, we can afford to have higher number of taps for the FBF in order to improve its SNR performance and yet achieve significant speed-ups.
C. Nonstationarity
To study the performance of the algorithms in the presence of channel nonstationarity, we model a time-varying channel [20] as follows
where h(n) is the vector of channel coefficients at time index n and I( n) is a vector of white Gaussian variables with standard deviation q.
In Fig. 12 (a), we show performance of PIPADFEl and PIPADFE2 with D1 = 4 as function of q. The step-sizes were chosen to be the same as that in Experiment B. It is clear that as the channel nonstationarity increases, the output SNR of PIPADFE2 and PIPADFEl degrade in the same fashion as that of the serial ADFE. This is also true for PIPADFE3 (see Fig. 12(b) ). However, for the drop in SNR for PIPADFE3 is about half that of PIPADFEl and PIPADFE2 for the same degree of channel nonstationarity.
D. Sum Relaxation
The sum relaxation, defined in (3), was employed to pipeline the weight-update equations for PIPADFEl (140, PIPADFE2 (240, and PIPADFE3 (30) . In this experiment, we show the effect of sum relaxation in improving the convergence speed for a given value of 0 2 (the pipelining level of the weightupdate loop) and given output SNR. All simulations were done with the conventional ADFE with D1 = 0.
In Fig. 13 , the MSE convergence plot for 0 2 = 1 and D2 = 4 is shown. With Dz = 1 and LA = 0, which is essentially the serial ADFE, a convergence time-constant of about 600 samples is obtained. When DZ is increased to 4 and LA = 0, the convergence time-constant is also increased by a factor of 4 to about 2400. In general, we find that (with the same step-size) the convergence time-constant is DZ times that of the serial ADFE. It is clear that any attempt to improve the convergence speed by increasing the step-size would result in a worse output SNR. However, by employing sum relaxation with LA = 3 the convergence time-constant is made equal to that of the serial ADFE without any loss in the output SNR. This clearly implies that very fine pipelining of the weight-update loop is possible.
Recall that D1 = 0 implies the equivalence of the serial ADFE, PIPADFEl and PIPADFE2. Hence, the conclusions of 
E. IIR Channel form [22],
In this experiment we consider the channel to be of the
where H ( z ) is the z-transform of h(n). The values of NF = 10 and NB = 7, were chosen for PIPADFEl and PIPADFE2. For PIPADFE3, the corresponding values were NF = 13 and N B = 10. The additive noise power was scaled such that the channel SNR was identical to that in Experiments A and B.
In Fig. 14, we have plotted the convergence time-constants for PIPADFEl, PIPADFE2, and PIPADFE3 with respect to the speed-up for a constant output SNR of 20 dB. As in the case of the magnetic recording channel (see Fig. lo) , we find that PIPADFE2 converges faster than PIPADFEl as the speedup increases. In addition, the convergence time-constant for PIPADFE3 increases linearly by approximately 54 samples per unit increase in speed-up. However, unlike in the case of a magnetic recording channel, PIPADFE3 always converges faster than either PIPADFEl or PIPADFE2, except in the case where the speed-up is unity.
Next, we plot (see Fig. 15 ) the output SNR with respect to speed-up for a given convergence time-constant (3000 saniples in this case). As can be seen in Fig. 15, PIPADFE3 has less than 1 dB loss in performance for speed-ups of up to 45. The performance of PIPADFEl and PIPADFE2 are more or less identical as was in the case of a magnetic recording channel. The loss in performance is less than 3 dB for speed-ups of up to 8. From Figs. 14 and 15, we conclude that not only does PIPADFE3 converge faster than PIPADFEl and PIPADFE2, but that it also has a better output SNR for the same speed-up.
VII. CONCLUSIONS
The technique of relaxed look-ahead [ 151 was employed to develop two fine-grain pipelined ADFE algorithms for high-speed equalization applications. These algorithms are attractive from an implementation point of view due to their low hardware requirements as compared to the existing parallel processing schemes. Speed-ups of up to 8 (for the conventional ADFE) and 45 (for the predictor ADFE) have been demonstrated for the equalization of a magnetic storage channel.
This work is a part of our research on the development of inherently pipelinable DSP algorithms, which are hardware efficient. Work in fixed-coefficient, inherently concurrent direct-form and lattice recursive digital filters [37] has also been successful. The pipelined algorithms presented in this paper can be improved further by incorporating improved relaxations, especially in case of PIPADFE2. In addition, combining coding with equalization, as was done in [38] , could also be exploited for the development of better pipelined equalization algorithms.
APPENDIX CONVERGENCE ANALYSIS
We employ the following results from The bounds on p for convergence in the mean-squared sense is given by
I and the misadjustment is given by
In the analysis to follow, we assume that the quantizer decisions are always correct and the channel input U(. ) is an uncorrelated +r1 sequence. Therefore, the results of the analysis are applicable to the training mode and also to the decision-directed mode with correct quantizer decisions. In all cases, we will first recast the pipelined algorithm in the form of (Al) and then analyze the autocorrelation matrix of the resulting data vector. In particular, we will attempt to derive an analytical expression for the sum of the eigenvalues of the data correlation matrix. In order to do this we employ the well known fact that CL; ' X i = tr [R] , where Xi's are the eigenvalues of R and tr [ .] is the matrix trace operator.
A. PIPADFEl
It is easy to show that with D1 = KD2. LA = 1 and
Thus, the bounds on p for convergence in the mean-squared sense is given by where S(n) is a time-varying term due to the presence of FBF coefficients in (A20). Note that we have employed (A13) to substitute for tr [R11] in (A22 
C. PIPADFE3
The PIPADFE3 equations can be written as follows
. E'(n -0 3 -1 )
where Dl3 = D1 + 0 3 = KD2 and ElT(.) = [e'(n -D1 -I ) , e'(n -D1 -2), ... , e'(n -D1 -N B ) ] .
(A24e) From (A26a) and (A26b), it can be seen that the FFF adapts independent of the FBF and this adaptation is of same form as the pipelined LMS filter (AI). Hence, we first analyze the FFF and then the FBF.
As the input to the FFF is X ( n ) , hence the trace of the autocorrelation matrix of the data vector for FFF is known where S = tr [HHTHHT] . Note that (A14) has been employed to substitute for the second term in (A27). Observing that HHTHHT is a symmetric matrix, we compute the sum of the squares of all the entries in HHT to obtain S. Furthermore, HHT itself is a symmetric matrix and from the definition of H in (A12), we get the following expression for the elements HHT along the ith diagonal
The trace of HHTHHT can be now written as where Rll is the autocorrelation of the data in the FFF and P1 is the crosscorrelation between the desired signal a ( n -A) and the data X(n). Employing (All) to substitute for X ( n ) , we may evaluate P1 as follows 
and the misadjustment is given by which are the desired equations.
