Abstract-Among existing works of high-speed pipelined adaptive decision feedback equalizer (ADFE), the pipelined ADFE using relaxed look-ahead technique results in a substantial hardware saving than the parallel processing or Look-ahead approaches. However, it suffers from both the signal-to-noise ratio (SNR) degradation and slow convergence rate. In this paper, we employ the predictive parallel branch slicer (PPBS) to eliminate the dependencies of the present and past decisions so as to reduce the iteration bound of decision feedback loop of the ADFE. By adding negligible hardware complexity overheads, the proposed architecture can help to improve the output mean-square error (MSE) of the ADFE compared with the Relaxed Look-ahead ADFE architecture. Moreover, we show the superior performance of the proposed pipelined ADFE by using theoretical derivations and computer simulation results. A VLSI design example using Avant! 0.35-m CMOS standard cell library is also illustrated. From the post-layout simulation results, we can see that the PPBS scheme requires only 38.4% gate count overhead, but it can help to reduce the critical path from 7.06 to 4.69 ns so as to meet very high-speed data transmission systems.
PPBS
Predictive parallel branch slicer. PPBS-ADFE ADFE based on PPBS scheme. SNR Signal-to-noise ratio.
UTP-CAT5
Unshielded twisted pair category 5. WUB Weight-update block for FBF. WUF Weight-update block for FFF.
I. INTRODUCTION
A DAPTIVE decision feedback equalizer (ADFE) using least mean-squared (LMS) algorithm is a well-known equalization technique for magnetic storage and digital communication. The basic block diagram of traditional ADFE is depicted in Fig. 1 , where ADFE is composed of two main FIR filters: the feedforward filter (FFF) and the feedback filter (FBF). The outputs from both filters are added together and fed into a slicer. The signal from the slicer output is the final equalized data. The basic function of FFF and FBF is to cancel the precursor and postcursor intersymbol interference (ISI), respectively. On the other hand, the WUC and WUD in the figure stand for the weight-update blocks for FFF and FBF, respectively. The detailed ADFE algorithm and structure are discussed extensively in [1] - [3] .
Basically, the fine-grain pipelining of the ADFE is known to be a difficult problem for high-speed applications. This is due to the decision feedback loop (DFL). According to the Iteration Bound [3] , the smallest clock period of ADFE is bounded by the DFL. Several approaches are proposed to solve aforementioned problems. For example, pipelining the ADFE can be achieved by precomputing all possible in DFL to open the DFL [4] . However, the parallel approach results in a large hardware overhead as it transforms a serial algorithm into an equivalent (in the sense of input-output behavior) pipelined algorithm. Another algorithm is proposed in [5] , which is referred as PI-PEADFE1. It maintains the functionality instead of input-output behavior using the technique of Relaxed Look-ahead. Although the hardware overhead in this algorithm is small, it suffers from some performance degradation such as output signal-to-noise ratio (SNR) and convergence rate. Nevertheless, from VLSI implementation point of view, the second approach is suitable for low-cost VLSI designs.
PIPEADFE1 intends to cancel the first post-cursor ISI terms by the feedfoward filter (FFF) .. However, we observe that it is not necessary to force the first taps of feedback filter (FBF) to zeros. In this paper, we employ the FFF to force the first coefficients of FBF to the more appropriate fixed coefficients instead of zeros. Through our derivation, we propose a predictive parallel branch slicer (PPBS) scheme. It can be used to eliminate the dependencies of the present and past decisions in order to reduce the iteration bound of DFL. By doing so, we can significantly improve the output MSE in the slicer. In addition, the MSE performance of the proposed PPBS architecture is analyzed mathematically. The theoretical results and computer simulation results show that the output MSE of our proposed architecture is superior to PIPEADFE1. The VLSI implementations show that, based on 0.35-m CMOS technology, the proposed PPBS scheme can help to reduce the critical path from 7.06 to 4.69 ns compared with conventional ADFE approach at the expense of reasonable hardware overhead.
The rest of this paper is organized as follows. In Section II, we first review the relaxed lookahead pipelined ADFE architecture. In Section III, we derive the proposed PPBS-based ADFE architecture. The performance analysis and computer simulation results are presented in Section IV and V, respectively. In Section VI, we show the design example of the PPBS scheme is UTP-CAT5 channel as well as its VLSI implementation. Finally, we conclude our work in Section VII.
II. REVIEW OF PIPELINED ADFE ARCHITECTURE (PIPEADFE1)
In [5] , the Delayed LMS [6] and the technique of transfer delay relaxation [7] are employed to develop the PIPEADFE1. Then, sum relaxation is applied to pipeline the updating circuit of ADFE. The equations describing the PIPEADFE1 (see Fig. 2 In this section, we will introduce the new architecture of pipelined ADFE, which is referred as PPBS-ADFE. For clarity, we will demonstrate the basic concept of PPBS-ADFE by using a simple example. In this example, we assume that the transmitted binary phase-shift keying (BPSK) signals are passed through the ISI channel, and the number of taps in FFF and FBF are and , respectively. In the most wirelined communication systems, the channel impulse response can be roughly estimated by the laboratory or field measurement. However, the practical channel impulse response is still different from the estimated channel impulse response. In this paper, we will exploit this feature to develop the PPBS-ADFE. According to the roughly estimated channel response, we can calculate the optimal coefficients, , for the first taps of the FBF in PPBS-ADFE, and fix these coefficients. Then, the decision of slicer at time instance , , can be expressed as
where input of slicer at time instance ; decision of slicer at time instance , ; vector of receiver samples in the input of FFF; is the input vector for the first coefficients in FBF;
input vector for the rest coefficients in FBF; vector of FFF coefficients; vector of the rest coefficients in FBF; vector of the first coefficients in FBF. The algorithm can be considered as a constraint optimization problem. Mathematically, it can be expressed as (4) where denotes the expectation operator. With the above settings, we can derive the predictive parallel branch slicer (PPBS) scheme. It is similar to the lookahead computation [4] , and can be employed to eliminate the dependencies between and in order to pipeline the DFL. Firstly, the inputs of PPBS, , can be written as
It also can be expressed as (6) where denotes the residual ISIs and noise components. According to (6), we still need to remove the first ISI terms in . There are branches inside the PPBS as shown in Fig. 3 . In the PPBS, the past decisions, , are not available. In the branch , we assume the past decisions are, , which is called the tentative past decision. For convenience, we define a mapping function, , as follows.
This mapping function can be interpreted as a one-to-one mapping from the index of branch to its corresponding tentative past decisions. Hence, the tentative decisions in the branch , , are as follows:
The tentative estimation error of each branch in the PPBS, , is as follows: (9) Since and are fixed, can be pre-computed in advance. The function in (8) and (9) can be implemented using a fixed coefficient adder. Hence, the hardware overhead and the critical path of PPBS are quite small. Finally, the actual decision of PPBS, , is one of these tentative decisions, and depends on the past decisions of PPBS, . Thus, can be expressed as
The above mapping function can be implemented by using the -to-one multiplexer. Moreover, the PPBS also needs to output the estimation error corresponding the final decision of PPBS, , in order to update the FFF and the rest FBF coefficients. The estimation error is computed as
This operation can also be implemented by the -to-one multiplexer.
The architecture of PPBS-ADFE with is shown in Fig. 3 . In general, the updating circuits applied in PPBS-ADFE, WUF and WUB, are similar to the PIPEADFE1 [5] . Since there are of extra delay elements in the DFL, the iteration bound of PPBS-ADFE can be reduced to , where is symbol multiplier delay, is the adder delay, and is the slicer delay. Then, the conventional retiming technique [1] can be applied to pipeline the DFL. Compared with the PIPEADFE1, the total overheads of hardware complexity in the PPBS-ADFE are merely adders and two -to-one multiplexer. Due to the inaccuracy of this roughly estimated channel impulse response, the output MSE of PPBS-ADFE will be degraded as the difference between the estimated and practical channel impulse response increases. The detailed relation between the output MSE and the inaccuracy of estimated channel impulse response will be analyzed mathematically in the next section. Moreover, it is obvious that the PPBS-ADFE becomes the PIPEADFE1 when the coefficients of the first taps, in the FBF are zeros.
IV. PERFORMANCE ANALYSIS OF PPBS-ADFE
In this section, we will show the relationship between the output MSE of PPBS-ADFE and the inaccuracy of the roughly estimated channel-impulse response by the mathematical analyses and computer simulations. In the following analysis, the optimal coefficients (Wiener solution) for the roughly estimated channel-impulse response is denoted as , where represents the optimal coefficients of FFF and and denote the optimal coefficients of the first taps and the rest coefficients in FBF, respectively. Similarly, the optimal coefficients for the practical channel impulse response is denoted as , where represents the optimal coefficients of FFF and and denote the optimal coefficients of the first taps and the rest coefficients in FBF, respectively.
Here, we also define the inaccuracy index,
to describe the inaccuracy of the first FBF coefficients in PPBS-ADFE. Clearly, the minimum mean-square error (MMSE) of PPBS-ADFE will be the same as the conventional (serial) ADFE when the fixed coefficients, , is equal to . Assume that the autocorrelation of the transmitted data is , where is the transmitted data variance. By following the analytical method in [1] , the MSE of PPBS-ADFE, given the first FBF coefficients, , can be expressed as (13) where is the autocorrelation of the PPBS-ADFE; is the autocorrelation of FFF input samples;
is the crosscorrelation of and ; is the crosscorrelation of and ; . Since is the positive-definite matrix, the Cholesky factorization can be applied to factorize . That is, , where is the lower triangular matrix. Then, we can reformulate (13) as (14) where (15) Since is fixed, the MMSE of the PPBS-ADFE, , can be expressed as (16)
Since
, we can have . Basically, can be interpreted as the MSE degradation of PPBS-ADFE. Here, we define in order to describe how the Inaccuracy Index of the estimated channel impulse response, , affects . Since is the Hermitian matrix. can be reformulated into the following form: (17) where is the diagonal matrix, and . From (17), it implies the PPBS-ADFE is most sensitive to the inaccuracy of the estimated channel impulse response when , where is the eigenvector corresponding to the maximum eigen-value of , . Therefore, we have the bound of as (18) can be also interpreted as the sensitivity index of the PPBS-ADFE. As long as is under the tolerated range, the output MSE of PPBS-ADFE can be lower than PIPEADFE1. This tolerated range will be explained in the following simulations.
V. COMPUTER SIMULATIONS AND HARDWARE COMPARISON
In this section, we will show that the output MSE of the proposed PPBS-ADFE can be lower than PIPEADFE1 by the simulation results when the estimated channel impulse response is accurate enough. In addition, we will show the requirement of to grantee that the output MSE of PPBS-ADFE is lower than PIPEADFE1.
A. Simulation I
In the Simulation I, the practical channel impulse response we employ in our simulations is, , which is obtained by the typical response of a good-quality telephone channel [2] . Assume that the number of taps in FFF and FBF are and , respectively, and the input SNR is equal to 18 dB. Using the method in [1] , we can calculate the optimal coefficients the first FBF coefficients for the conventional ADFE, . Hence, we assume the first FBF coefficients of the PPBS-ADFE, , which is obtained by the optimal solution of the roughly estimated channel impulse response in advance. From (12), the inaccuracy of , , is equal to 0.25. The learning curves of conventional ADFE, PIPEADFE1, and PPBS-ADFE are depicted in Fig. 4(a) . In Fig. 4(a) , the three horizontal lines show theoretical MSE bounds of the conventional ADFE, PIPEADFE1, and PPBS-ADFE, respectively. Next, we change the input SNR to 24 dB and repeat the simulation, and its simulation results are shown in Fig. 4(b) . Based on the simulation results shown in Fig. 4(a) and (b) , it illustrate that the output MSE of the proposed PPBS-ADFE is lower than the PIPEADFE1 when in this example. The FFF of PIPEADFE1 intends to cancel all precursor ISI terms and the first postcursor ISI. terms. This can be indicated in Fig. 5 , where the combined channel and FFF impulse response for PIPEADFE1 with is shown. In PPBS-ADFE, the FFF does not intend to eliminate all the first postcursor ISI terms. Hence, the burden of the FFF in PPBS-ADFE can be alleviated. This phenomenon can be indicated in Fig. 6 ., where the combined channel and FFF impulse response for PPBS-ADFE with is shown. It is obvious that the FFF forces the first postcursor ISI's seen by FBF to instead of canceling the first ISI terms. 
B. Sumulation II
In the Simulation II, we will show the requirement of to guarantee that the output MSE of PPBS-ADFE is lower than PIPEADFE1. From (17), we can see that, as long as , the output MSE of PPBS-ADFE is always superior to the PIPEADFE1. We will show this property by a simple example. Basically, we will apply the same parameters and the channel environment like the Simulation I except . In this example, we will consider the three types of , which are , , and . ( , , and are  ,  ,  and , respectively, where , , are scalar) Then, we will observe their maximum achievable MSE at different by the simulation results and the theoretical results. By the method in [1] , the minimum achievable MSE of the PPBS-ADFE is (19) where is the step size, is the th eigenvalue of . The simulation results (dash line) and the theoretical lower bounds (solid line) for each case are shown in Fig. 7 . The PIPEADFE1 is the case when . Assume the MSE degradation of the PPBS-ADFE, , is always lower than the MSE degradation of PIPEADFE1, . Based on the simulation results shown in Fig. 7 and (17) , the maximum inaccuracy index, , must be smaller than in order to guarantee that the output MSE of PPBS-ADFF is lower than PIPEADFE1. In other words, must be small than . However, we may loosen the requirements n the practical situation since is not always in the direction of . In practical, since are often the combination of all orthogonal eigen-vector of , the designers can store more than one in the ROM, and select one value for applications. Then, we can achieve the lower output MSE.
C. Comparisons of Hardware Complexity
Next, we consider the comparisons of hardware complexity between PIPEADFE1 and PPBS-ADFE under the same speedup factor, . Since the fixed coefficient adders and slicers in the PPBS can be implemented using an adder, we treat these two components as one slicer. The comparisons between the PIPEADFE1 and PPBS-ADFE with speedup are listed in Table I . There is still another ADFE structure, PIPEADFE2 proposed in [5] . We name it as Pre-Processing ADFE (PP-ADFE) in our discussion. In above simulations, we only compare our PPBS-ADFE with PIPEADFE1, but did not directly compare it to PP-ADFE due to the following reasons:
• From hardware point of view, PIPEADFE1 is the best and the easier way to relax the critical path of DFL by just inserting the delays. The proposed PPBS-ADFE is almost the same as PIPEADFE1 except the modification of the slicer part.
• The PP-ADFE is a totally different architecture from PI-PEADFE1 and the better converge rate is from the usage of the extra filter, Pre-Processing unit (PP). Strictly speaking, it is not a direct modification form PIPEADFE1 since it employs approximations to derive the Pre-Processing scheme. Moreover, the output SNR of PP-ADFE is the same as PIPEADFE1 since they both force the first tap in FBF to zeros. Compared with PIPEADFE1 and PP-ADFE, our PPBS-ADFE will achieve better output SNR when we can estimate the channel characteristic in advance, which is discussed in Section IV. In Table I , we compare the hardware complexities about aforementioned four types of ADFE: Series ADFE, PI-PEADFE1, PP-ADFE, and PPBS-ADFE. For our design, there are two issues, which will result in the hardware overhead. One is large D1 problem and the other is the large constellation problem. We explain both issues as follows:
• problem: The hardware complexity would be very high due to the large . However, for most applications, the is small (typically 1 to 4). The reason is that the speedup factor is , and usually we need not to speed up the circuits too much.
• Large constellation problem: The PPBS-ADFE is suitable for low constellation modulation such as BPSK, MLT-3, PAM-3, PAM-5 and QPSK, because the PPBS-ADFE with high constellation modulation will require a large hardware complexity overhead. Although the PPBS-ADFE requires additional slicer, the BPSK slicer is just a constant adder, which is very small and has no timing issue. In summary, the overhead of PPBS-ADFE is acceptable in low constellation applications. On the other hand, the complexity of the PP-ADFE is almost the same as that of Serial ADFE. However, we have to highlight the increased hardware in the PP scheme. Basically, the multipliers in FBF are very simple, because the input signals are in "symbol space". For example, only 3-bit multipliers are needed in PAM-5 constellation. On the contrary, the multipliers of FFF and PP are full multiplier. For example, if the output signal of A/D converter is 10 bit and the word length of weights is 15, the size of multiplier is as large as 10 15. Therefore, the hardware complexity of PP is very high, and thus, the PP-ADFE has much higher overhead to be implemented in real chips. To sum up, the PPBS-ADFE is a modified version from PIPEADFE1 while keeping the advantage of PIPEADFE1.
VI. DESIGN EXAMPLE AND VLSI IMPLEMENTATION OF PPBS-ADFE

A. UTP-CAT5 Design Example
In the design example, we will apply the PPBS-ADFE to Gigabit Ethernet application where the chancel is a typical Unshielded Twisted Pair Category 5 (UTP-CAT5) channel. We assume that the receiver can operate in the optimal sampling phase. The channel model we considered in this example is illustrated in Fig. 8 . Before we begin the design example of the PPBS-ADFE, we will introduce some characteristic of the UTP-CAT5 cables.
• The channel impulse responses of the UTP-CAT5 with different cable length are quite different. The typical channel impulse response, , with and m are shown in Fig. 9 . Based on the observation in Fig. 9 , we discover the cable impulse responses of UTP-CAT5 cable with different cable length have the similar shape.
• The signal-to-near-end crosstalk (NEXT) ratio (SNR) of the longer cable is much lower than the shorter one since the longer cable introduces the larger attenuation. Assume we transmit 5-level pulse-amplitude-modulation (PAM5) over the UTP-CAT5 cables, which's cable length range from 0 to 100 m. It is obvious that it results in the worst output MSE of the ADFE when we transmit PAM5 signal over 100-m UTP-CAT5 cable. Hence, we can apply the estimated channel-impulse response of 100-m UTP-CAT5 cable to obtain in order to minimize the worst case output MSE (SNR) of the PPBS-ADFE. Here, we assume , , and . It is clear that is equal to zero when the cable length is 100 m. The input SNR and the output SNR of the conventional ADFE, PIPEADFE1 and PPBS-ADFE with different cable length are shown in Fig. 10. From Fig. 10 , we can make the following observations:
• The output SNR of the conventional ADFE, PIPEADFE1, and PPBS-ADFE increase when the cable length decreases. In addition, all approaches suffers less SNR degradation when the cable length is less than 100 m.
• The output SNR of PPBS-ADFE is superior to the PIPEADFE1 with cable length of UTP-CAT5 ranging from 100 to 10 m. Since the channel impulse response of 0-m UTP-CAT5 cable is close to a delta function, the optimal first taps for 0 -m UTP-CAT5 cable are almost zeros. Hence, the output SNR of the PIPEADFE1 will be higher than the output SNR of PPBS-ADFE. However, the PIPEADFE1 cannot perform well when the first post-cursor ISI terms are significantly large. Hence, the PPBS-ADFE can be employed to resolve this problem.
• The design example demonstrates the PPBS-ADFE design is very effective for UTP-CAT5 cables. By using the same methodology, we can apply the PPBS-ADFE to other receiver designs of wirelined communication systems.
B. VLSI Implementation of PPBS-ADFE
In this section, we implement the PPBS-ADFE in the Section VI by using Avant! 0.35 m standard cell library (see Fig. 11 ), where the zero-force LMS is used to decrease the dynamic range of weights of the FFF [8] . For comparisons purposes, we also apply the same wordlength assignment and the same arithmetic units to implement the conventional ADFE. The synthesis results of PPBS-ADFE and the conventional ADFE, which are obtained from the Synopsys Design Complier are depicted in Table II . Here, we summary the VLSI implementation results of the conventional ADFE and PPBS-ADFE by the following descriptions.
• The synthesis results of the conventional ADFE and PPBS-ADFE are shown in Table II . In our VLSI implementation of PPBS-ADFE, the total overhead of PPBS-ADFE is about 70%. The overheads coming from the PPBS, which contains 25 PAM5 slicers, are about 38.4%. The overheads coming from the other modules are almost the pipelined registers.
• Because the physical delay of flip-flop is not zero, the critical path of PPBS-ADFE must be re-expressed as , where is the physical delay of flip-flop. In the VLSI implementations of the conventional ADFE and PPBS-ADFE, the Timemill simulation shows that the critical paths of the conventional ADFE and PPBS-ADFE are 7.06 ns and 4.69 ns, respectively. Hence, the throughput rate of PPBS-ADFE can be 1.5 times faster than the conventional ADFE at reasonable hardware overhead. It demonstrate the effectiveness of the proposed design.
VII. CONCLUSIONS
In this paper, a new pipelined PPBS-based ADFE using the roughly estimated channel impulse response is presented. To make sure PPBS-based ADFE can be applied to the cases where the pre-estimated channel is not exactly the real channel, we employ the concepts of the sensitivity and inaccuracy indexes to provide a systematical analysis. Compared with the relaxed lookahead ADFE algorithm in [5] through the simulations and analyses, we show that the output MSE of the proposed algorithm can be improved by adding negligible overhead of hardware complexity. Therefore, the proposed PPBS-based ADFE can achieve both goals of high SNR and high operation speed. Moreover, we also verify the proposed pipelined PPBS-based ADFE architecture by implementing a prototyping chip based on Avant! standard cell library. The target application is Gigabit Ethernet and the relative channel is UTP-CAT5 cable. Without loss the SNR in long cable case, the delay of critical path can be improved from 7.06 to 4.69 ns compared with conventional ADFE schemes. To sum up, we provide a novel alternative approach for the design of high-seed pipelining ADFE when the output MSE is critical.
