Abstract. Echo canceller plays an important role in the full-duplex communication system. Conventional implementations of echo cancellers are often the adaptive transversal filter architectures due to the simplicity and robustness of stability and convergence. However, the conventional echo cancellers suffer from high cost problem especially when the response time of the echo is long. In this paper, a new cost-efficient architecture of echo cancellers, targeting on 10GBase-T Ethernet System, is presented. The proposed scheme inherits the concept of channel shortening which is widely employed in DSL systems. A shortened impulse response filter is implemented at the receiver to shorten the impulse response of the echo signal. Hence, the overall cost of echo cancellers can be reduced. We generalize the channel shortening architecture to a joint multi-channel shortening scheme. The joint multi-channel shortening architecture can be applied to multiple-input multiple-output wireline communication systems to further reduce both the cost of echo and near-end crosstalk (NEXT) cancellers. We apply the proposed scheme to 10GBase-T Ethernet system. The simulation results show that the proposed echo and NEXT cancellers can save up to 35% hardware cost compared to the conventional transversal implementations.
Introduction
Full-duplex data transmission over two wire circuits can be achieved by the application of hybrid couples. However, hybrid circuits give rise to imperfect isolation between transmitter and receiver at either end of communication systems. A part of the transmitted signal directly leaks into local receiver as a result of an unbalance within the hybrid circuit. Another part of the transmitted signal is reflected from impedance mismatch of far-end hybrids, and also ends up in local receiver. Unfortunately, the energy of echo signal is often larger than the received signal. Hence, high echo cancellation level is required to prevent the far-end signal from interfering with echo signals. In practice, in addition to 10-20 dB attenuation provided by the analog hybrid circuit, extra digital echo cancellation is also performed to reduce the echo signals to an acceptable level.
Usually, the architecture of digital echo canceller is the transversal filter structure. As shown in Fig. 1a , a replica of echo is synthesized at the receiver side. By subtracting the received signal to the replica of echo, the echo interference is eliminated. The complexity of the echo canceller is determined by the factor of D/T, where D denotes to the duration of echo impulse response and T is the symbol period. When the symbol rate is high and the duration of echo impulse response is very long, the complexity of echo canceller becomes disastrous. For high-speed applications such as 10GBase-T Ethernet [1] , hundreds of taps are required to achieve the required high performance.
We know the cost of conventional echo canceller is proportional to the length of echo impulse response. If we can reduce the length of echo impulse response, we can reduce the cost of echo cancellers. Channel shortening technique is first used as the pre-filter for maximum likelihood sequence detection receivers [2, 3] , in which a Viterbi decoder is applied to Mn states, where M is the number of levels for a PAM system and n is the channel memory. The channel shortening technique is used to reduce the channel memory. Thus, the cost of Viterbi decoder is reduced. Recently the channel shortening technique is widely used in discrete multi-tone (DMT) system [4, 5, [17] [18] [19] , in which the cyclic prefix, also called guard interval, technique is used to avoid the inter-symbol interference (ISI). The length of cyclic prefix is the same as channel memory. Hence the channel shortening technique is applied to reduce the length of cyclic prefix to increase the transmission efficiency.
Based on the principle of channel shortening, we propose a low-cost architecture of echo canceller, targeting on 10GBase-T Ethernet System, as shown in Fig. 1(b) . A shortened impulse response filter (SIRF) is used to shorten the echo impulse response to reduce the cost of echo canceller. As shown in Fig. 2 , if we can shorten the echo response from the original solid line to the shortened dash line. The required tap length of echo canceller is reduced, and the cost of echo canceller can be reduced. Based on this principle, we derive a new cost-efficient echo canceller for copper-based 10G Ethernet Systems. The channel shortening technique can also be applied to near-end crosstalk (NEXT) cancellers for cost reduction since the architecture of NEXT cancellers is similar to echo cancellers. Besides, we generalize the concept of channel shortening to joint multi-channels shortening. The SIRF can help to jointly shorten the response of echo and NEXT simultaneously to reduce the cost of echo and NEXT cancellers. The joint multi-channels shortening architecture can be applied in multiple-input-multiple-output (MIMO) systems, in which the cross interference problem is knotty. We apply the joint multi-channels shortening architecture to 10GBase-T Ethernet system, and show the proposed architecture can save the cost of echo and NEXT cancellers.
The rest of this paper is organized as follows. In Section 2, we first review the existing works of echo cancellers and derive the new architecture and algorithm of echo cancellers. In Section 3, we generalize the results of Section 2 to joint multichannels shortening. For modern MIMO wireline communication system, we provide a low cost architecture of joint echo and NEXT cancellers. To determine the design parameters of the proposed echo and NEXT cancellers, a design procedure is provided. In Section 4 we apply the proposed scheme to 10GBase-T Ethernet application. Computer simulation and cost comparison is presented. Finally, Section 5 concludes the work of this paper.
Channel Shortening-Based Echo Canceller

Review of Existing Echo Canceller Works
The architecture of a conventional echo canceller is shown in Fig. 1a [6, 7] . The echo signal is modeled as the result of an unintended transmission path between the local transmitter and local receiver. In baseband data transmission system the echo path is basically linear and varies very slowly with time, e.g., as a result of changes in temperature. Therefore, echo cancellation can be achieved by providing a parallel path between transmitter and receiver in which, e_(n) a replica of the received echo e(n), is automatically formed by means an adaptive finite impulse response (FIR) filter. By subtraction of the replica from the incoming signal, echo can be eliminated. In order to model the echo response correctly, the duration of echo impulse response must be the product of the tap length of the adaptive FIR filter and the system clock period. The cost of echo canceller is proportional to the length of echo impulse response. When the echo impulse response is long, the cost of adaptive FIR filter increases significantly because a lot of multipliers and registers are required.
In Fan and Jenkins [8] , adaptive infinite impulse response (IIR) filter architecture designs are proposed as echo cancellers. Adaptive IIR filters can match poles as well as zeros of the channels while adaptive FIR filters only approximate the response of channels. The IIR filter architectures usually cost less than FIR because IIR designs often require less tap number than the FIR counterpart. However, adaptive IIR filters suffer from stability and the slow convergence problems because the error surfaces for adaptive IIR filters may not be unimodal [9] . In high-speed application, such as 10 Gigabit Ethernet system, the IIR filter architectures may not converge during the specified time.
We adopt the adaptive FIR filter architectures due to their simplicity and robustness of stability and convergence. To solve the cost problem of adaptive FIR filter architectures, we apply the channel shortening technique. Although an additional adaptive filter is required to shorten response of the channel, the total cost of proposed architectures is less than the conventional transversal architectures because the cost reduction of echo canceller is more than the cost overhead of additional adaptive filter.
Architecture of the Proposed
Channel-Shortening-Based Echo Cancellers Figure 3a shows the block diagram of channel shortening, we can see a SIRF is implemented at receiver. The purpose of SIRF is to shorten the impulse response of the effective channel, which can be modeled as the linear convolution of channel and SIRF. That is
Generally speaking, the length of the effective channel is longer than the original channel since the length of effective channel is the sum of the original channel and the SIRF. However, if we constrain the energy of the effective channel in consecutive n taps and assume n is smaller than the length of original channel, the channel shortening is achieved. As shown in Fig. 3b , the goal of SIRF is to constrain the energy of effective in the n taps window. Besides, because the largest n samples will not necessary begin with the first samples, we need to add a delay parameter d to compensate at the receiver by delaying the start of the received symbols.
Various approaches are proposed to find the optimal coefficients of SIRF, w(n), with different design goals [10] [11] [12] [13] [14] [15] [16] [17] [18] . In Melsa et al. [12] and Milosevic et al. [13] , the design goal is to maximize the shortening signal-to-noise ratio (SNR) which is the ratio of the energy in the largest consecutive n samples to the energy in the remaining samples. In Arslan et al. [10] and Vanbleu et al. [11] , the design goal is to maximize the bit rate of the DMT system. In addition to the above approaches, minimum meansquared error (MMSE)-based channel shortening algorithms are also proposed to perform channel shortening. We adopt MMSE-based algorithms because the design goal of echo canceller is to reduce the signal power of echo. The architecture of MMSE-based channel shortening is shown in Fig. 4 . MMSE-based channel shortening introduces another adaptive FIR filter, b(n), which represents the response of n taps window, i.e. the shortened response.
The error is denoted as the difference between the effective channel and the desired shortened channel. By minimizing this error, the response of the effective channel is equal to the desired shortened channel. Similarly, a delay d is inserted in the lower path to control the start position of the window. Hence, the cascade response of h(n) and w(n) equals to b(n) with delay d, i.e.,
In Bladel and Moeneclaey [16] , least square (LS) algorithm is proposed. LS algorithm collects the statistics during the transmission then performs complex operation, such as matrix inversion and eigenvalue computation. Auto-regressive moving average algorithm [17] divides the N-dimension matrix inversion operation of LS algorithm into N iterations. For each iteration, only 2-dimension matrix inversion is performed. In this paper, least mean square (LMS) algorithm [18] is adopted because the simple architecture of LMS algorithm is most suitable for hardware implementation.
In Fig. 4 , we have shown the block diagram of MMSE-based channel shortening. Equation 2 implies that the computed solution, w(n) and b(n), can minimize the mean square error, E{|e(n)| 2 }, where denotes the error between z(n) and d(n). Let
represents the tap-input vector of target channel b(n) and channel h(n), respectively, we define the cost function J as the mean square error:
where R xx =E{xx T }, R yy =E{yy T } and R yx =E{yx T }. In Chang [18] , the LMS approach is proposed to compute the coefficient based on the steepest descent algorithm where the coefficient update is performed by iteration. The successive adjustments to the tapweights of w n and b n at iteration n are made in the direction of the steepest descent of the error surface, that is, in a direction opposite to the gradient vector:
where m b and m w are step size, lJ b and lJ w are the gradient vector of b and w, respectively. J is the cost function of mean square error defined in Eq. 5. By taking partial differentiation to the mean square error J, i.e. E{| e(n) | 2 }, we can get the gradient vector lJ b and lJ w :
The LMS algorithm use instantaneous values as ensemble statistics. That is, substituting Eq. 7 into Eq. 6 and ignore the expectation operators in Eq. 7, the updated value of tap-weight b n+1 and w n+1 at iteration n+1 is From Eq. 8, we can derive the architecture of the LMS approach channel shortening, as shown in Fig. 5 . The error is feedback into SIRF w(n) and target channel b(n) to adjust the weight coefficient. The step size, m b and m w , are important parameters in the algorithm. The choice of m b and m w will affect the speed of convergence, stability, and the shortened performance. On the other hand, the energy constraint, i.e., b }, will converge. This implies that the cascade of h(n) and w(n) will approach to b(n), which is only n taps. In other words, the impulse response h(n) is shortened to b(n).
We apply the channel shortening technique to echo interference. The channel response in Fig. 5 is echo interference specified in IEEE [1] and the target channel response will be the shortened echo interference. The impulse response of echo is shown in Fig. 6a and the duration of echo is 500 samples. The channel shortening algorithm discussed previously requires the choice of a window location on the impulse response prior to calculating the SIRF coefficient. Thus, we use the optimal shortening algorithm [12] with sliding window method to find the optimal window delay because the optimal shortening algorithm doesn_t require any other parameters that affect shortening performance. After finding the optimal window delay, we apply the LMS approach channel shortening on echo. The window size is 300. The shortened response is shown in Fig. 6b . We can find that the energy of shortened response is concentrated in samples from 170 to 470. Compare to the original echo response, the ripples at the start and end of shortened response is removed. Now we have proposed a low cost channel shortening echo canceller. We propose to add SIRF at the receiver to reduce the length of echo impulse response. However, in the MIMO system, such as 1000Base-T and 10GBase-T, the channel impairments contain not only echo loss but also NEXT loss. The echo and NEXT cancellers are implemented to eliminate the unintended interferences. If the SIRF can reduce not only the echo impulse response but also the three NEXT impulse responses, then we can both reduce the cost of echo and NEXT cancellers. Therefore, we generalize the channel shortening architecture to a joint multi-channel shortening scheme in next section.
Generalized Joint Shortening Echo
and NEXT Cancellers
Architectures of the Joint Shortening Echo and NEXT Cancellers
In Melsa et al. [12] , Al-Dhahir [14] , and Milosevic et al. [15] , joint shortening approach is applied to DMT system. The SIRF is used to jointly shorten the channel and echo. However, these algorithms require large matrix operation. Hence these algorithms are not suitable for hardware implementation. In this section, we generalize the concept of two channels shortening in Al-Dhahir [14] to multi-channels shortening. This implies that the SIRF is used to jointly shorten echo and three NEXT responses. In addition, we derive the corresponding LMS algorithm and the hardware architecture. The concept of joint shortening is very similar to channel shortening. We first take two channels joint shortening as an example. Then we generalize the results of two channels shortening to multi-channels shortening.
In Section 2, we have discussed the LMS approach of channel shortening. In this section, we deduce the LMS joint two channel shortening approach from the result of the LMS channel shortening. The SIRF now is used to jointly shorten two channels. Thus, we first rewrite the error in Eq. 3 as
where d 1 (n) and d 2 (n) are the output of target response filter b 1 and b 2 , respectively. z(n) is the output of SIRF w. We use the LMS approach to compute the coefficient of b 1 , b 2 and w, based on the steepest descent algorithm where the coefficients are updated by iterations. The successive adjustments of b 1 , b 2 and w at iteration n are made in the direction opposite to the gradient vector:
where m b1 , m b2 and m w are step size, lJ b1 , lJ b2 and lJ w are the gradient vector of b 1 , b 2 and w, respectively. J is cost function of mean square error, that is
By taking partial derivative to the mean square error J, we can obtain the gradient vector lJ b1 , lJ b2 and lJ w :
Substituting Eq. 12 into Eq. 10 and ignore the expectation operators, the updated tap-weight b 1,n+1 , b 2,n+1 and w n+1 at iteration n+1 become e n ð Þ ¼ b
ð Þx 1;n ; b 2;nþ1 ¼ b 2;n À " b2 e n ð Þx 2;n ; w nþ1 ¼ w n þ " w e n ð Þy n ; Fig. 7 .
If we compare the architecture of the LMS approach channel shortening show in Fig. 5 and the LMS approach joint two channels shortening shown in Fig. 7 . We can find the difference between these two architectures is extra target response filter b and a delay. Therefore, we can generalize the results above to multi-channels joint shortening with the LMS approach. Suppose we want to jointly shorten N channels. The architecture is shown in Fig. 8 . The updating mechanism of joint multi-channel shortening can also be easily deduced from the Eq. 13.
ð Þx i;n ; w nþ1 ¼ w n þ " w e n ð Þy n :
The energy constraints are set as b T i;n b i;n ¼ 1 to avoid trivial solution.
Analysis of The Echo and NEXT Loss Enhancement Performance
Although the proposed echo and NEXT cancellers provide a low cost design approach, the design parameters are not easy to be determined due to large design freedoms. Besides, many adaptive FIR filters are trained simultaneously. These adaptive filters may converge to local minima rather than global minima. It is recognized that the design of optimal parameters is difficult [19] . In this paper, we suggest a heuristic design procedure. Instead of exhaustively searching all design space, we search optimal parameters for each channel step by step. Before discussing the design procedure, we first define the performance measure. The performance measure is called Echo and NEXT loss enhancement (ENLE) [20] which is the ratio between the rootmean-square (RMS) of the input echo and NEXT R i to the RMS of the output residual echo and NEXT R o . It is often expressed in dB.
We analyze the ENLE performance in qualitative terms. From Fig. 5 , it is obvious that the synchroni- zation delay and the window size of the channel will affect the ENLE performance together. Theoretically, for a given window size, there exists an optimal synchronization delay that will achieve the best ENLE performance. However, the search space will become very large because every window size has its corresponding optimal synchronization delay. In order to reduce the search space efficiently, we do further observations for the ENLE performance. At first, we focus on echo channel model specified in IEEE [1] and NEXT channel models follow the same way. Figure 9 shows a threedimensional computer plot of the ENLE performance of echo versus the synchronization delay and the window size of echo. If the synchronization delay is fixed, the ENLE performance is better when the window size is larger. Besides, if we fix several window sizes and observe the relationship between the ENLE performances and the synchronization delay, the result is shown in Fig. 10 . For each window size of echo, the best ENLE performance is achieved with almost the same optimal synchronization delay. That means we can find the optimal synchronization delay with a specific window size, and this optimal synchronization delay is also hold for all other window sizes. Furthermore, we can find the optimal synchronization delay first, and then determine the window size properly to meet the target ENLE. Hence, we can largely reduce the search space for the optimal synchronization delay for each channel even though there is no direct relationship between the synchronization delay and the ENLE performance [19] . Finally, it also concludes that our design flow which will be discussed latter is very close to optimal solution for channel shortening LMS approach.
Design Flow of the Joint Shortening Echo and NEXT Cancellers
Basically, there are three design steps. The first two steps determine the optimal parameters of each channel. The last step joins the parameters together and adjusts these parameters to the meet the target ENLE. The detailed procedure is as follows:
Step 1 Determine the optimal synchronization delay of each channel. The synchronization delay d in Eq. 2 plays an important role in the design of channel shortening. Unfortunately, there is no direct relationship between the synchronization delay and the performance ENLE [19] . But from further observations, we can largely reduce the search space for the optimal synchronization delay.
Step 2 Determine the window size of each channel.
After the optimal synchronization delay is determined, we then search the optimal window size n of each channel. Large window size yields to better performance. For the target ENLE, we first try a large window size. By shrinking the window size down and monitoring the ENLE, we can get the optimal window size.
Step 3 Join the parameters and adjust them together.
When the SIRF is used to jointly shorten multiple channels, the performance is degraded since the shortening strategy now is not concentrated on one channel. Hence, we adjust delay or increase the window size to achieve the target ENLE. In Fig. 11 , we illustrate the corresponding design flowchart of the proposed design methodology and optimization procedure.
Computer Simulations and Hardware Comparison
In the simulations, the proposed scheme is applied to 10 Gigabit Ethernet applications. 10GBase-T achieves 10 Gbps full-duplex transmission over 4 unshielded twisted pair (UTP) copper line. The line coding is assumed 12-level pulse-amplitude-modulation (PAM12) over the UTP cat. 6. The cable length is 55m. Besides, the equalization scheme is Tomlinson-Harashima precoding (THP) where the post-cursor ISI is dealt in the transmitter to avoid error propagation problem in the conventional decision feedback equalizer (DFE) scheme. The channel impairments that we consider here are the insertion loss, echo and NEXT. The models are available from IEEE 802.3an website. We assume the receiver can operate in the correct sampling phase. The transceiver architecture of 10GBase-T is shown in Fig. 12a . For each line, there are one echo canceller and three NEXT cancellers to eliminate the interferences. Since the transmitted symbols are filtered by THP before performing echo and NEXT cancellation, the input wordlength of echo and NEXT cancellers are very large. The target performance of the echo and NEXT cancellers is set ENLE=45 dB.
We apply the proposed scheme to 10GBase-T Ethernet system, as shown in Fig. 12(b) . A SIRF is implemented just after the ADC block. For the proposed architecture, both N=1 and N=4 schemes are simulated. In N=1 case, the SIRF is applied to shorten echo response because the power of echo is often larger than NEXT and the duration of echo is longer than NEXT. In N=4 case, the SIRF is applied to shorten both echo and three NEXT responses simultaneously in order to largely reduce the total hardware cost of 10GBase-T Ethernet system.
The eye diagram and the corresponding flow chart are shown in Fig. 13 . We divide the system operation into three stages. The first stage is echo and NEXT cancellers training mode. At the end of the stage, the eye converges to zero level because the interferences are eliminated. The second stage is equalizer training mode. The interferences are eliminated by echo and NEXT cancellers and the DFE turns on to compen- sate ISI. At the end of the stage, the coefficients of equalizer converge and the eye is PAM2. Finally, all the channel impairment is compensated by receiver. We can transmit PAM12 data symbol. Thus the eye is 12 levels. The learning curve of equalizer is shown in Fig. 14. We perform 200 randomly independent rounds and average the results. We can see that the performance of the three architectures can meet the SNR requirement, 23.8 dB. When the SNR is greater than 23.8 dB, we can achieve 10 j12 BER with low density parity check code [1, 21] . The performance gap between the proposed architectures and the conventional is less than 1 dB. This indicates that we can save lots of hardware cost with negligible performance degradation. The performance and cost comparison of joint echo and NEXT cancellers is summarized in Table 1 . In N=1 case, the tap number reduction of echo canceller is 40%. In N=4 case, the tap number reduction of echo canceller is 35% and tap number reduction of NEXT cancellers is 50%. The tap number reduction of echo canceller in N=4 case is less than N=1 case because the shortening strategy in N=4 case is not concentrate on echo signal.
Although the proposed schemes can effectively reduce the cost of echo and NEXT cancellers, the SIRF at the receiver affects the received signal. Thus, the effective channel is the cascade of SIRF and channel. The length of effective channel impulse response is longer than the original channel impulse response since the SIRF only shortens the impulse response of echo and NEXT. Implementing SIRF at the receiver causes the cost of equalizer increasing. In order to analyze the introduced cost overhead of equalizer, we perform fixed-point system simulation and analyze the cost estimation. The optimal wordlength of each DSP component is summarized in Table 2 , where W d is the wordlength of data input, W c is the wordlength of the coefficient input and W o is the wordlength of the output. In Fig. 12 , most blocks are composed of adaptive filters. Once the wordlength is determined, we can estimate the hardware cost. Figure 15 shows the architecture of the adaptive FIR. An N-tap adaptive FIR consists of the two main blocks. One is transversal filter, which consists of N-1 delay elements, N-1 adders and N multipliers. The other is weight update block. Each weight update unit requires a delay element, a multiplier and an adder, as depicted in the right side of Fig. 15 . Hence, we can divide the hardware cost of adaptive filter into two parts, storage and arithmetic units. Here we assume the architecture of multiplier is Baugh-Wooley multiplier while the architecture of adder is carry ripple adder. The cost comparison is summarized in Table 3 . We can find the introduced equalizer overhead is less than the hardware reduction of echo and NEXT cancellers. Therefore, the proposed architecture can save the cost of 10GBase-T transceiver about 12 and 35%, respectively.
Conclusions
In this paper, we propose a new cost-efficient architecture of echo and NEXT cancellers. The cost of echo and NEXT cancellers is reduced by shortening the length of echo and NEXT impulse response. We generalize the concept of LMS channel shortening to the MIMO models and propose a design procedure to determine the optimal design parameters. We apply the proposed architecture to 10GBase-T Ethernet System. Simulation results show that the proposed schemes for N=1 and N=4 can save the cost about 12 and 35%, respectively.
Yen-Liang Chen (S'07) was born in Taiwan 
