Presented in this paper are low-power and high-speed algorithms and architectures for complex adaptive lters. These architectures have been derived via the application of algebraic and algorithm transformations. The strength reduction transformation is applied at the algorithmic level as opposed to the traditional application at the architectural level. This results in a power reduction by 21% as compared to the traditional cross-coupled structure. A ne-grain pipelined architecture for the strength-reduced algorithm is then developed via the relaxed look-ahead transformation. This technique, which is an approximation to the conventional lookahead computation, maintains the functionality of the algorithm rather than the input-output behavior. Convergence analysis of the proposed architecture has been presented and supported via simulation results. The pipelined architecture allows high-speed operation with negligible hardware overhead. It also enables an additional power savings of 39% to 69% when combined with power-supply reduction. Thus, an overall power reduction ranging from 60%-90% over the traditional cross-coupled architecture is achieved. The proposed architecture is then employed as a receive equalizer in a communication system for a data rate of 51:84 Mb/s over 100m of UTP-3 wiring in an ATM-LAN environment. Simulations results indicate that speed-ups of up to 156 can be achieved with about 0:8 dB loss in the performance.
I. INTRODUCTION
Digital communications systems are currently being developed for high-bit rate transmission over bandlimited channels. These applications include asymmetric digital subscriber loop 8, 22] (ADSL), high-speed digital subscriber loop (HDSL) 20, 26, 45] , very high-speed digital subscriber loop (VHDSL) 7,17], ATM- LAN 18] and interactive multimedia television (IMTV) 19], high-density magnetic recording 9], wireless systems 1], and digital High-De nition TV (HDTV) transmission 29, 30] . In each of these applications, the bandlimited nature of the channel and the required performance levels necessitate the use of highly complex digital communications algorithms.
In addition to the increasing computational complexity, the requirements on a silicon implementation have also become stringent at the same time. There is no doubt that a cost-e ective silicon implementation is critical for a successful deployment of any new technology. Therefore, constraints from a VLSI implementation perspective such as power dissipation, area, speed and reliability also come into the picture. Design of high-speed and low-power algorithms and architectures is in great demand for all the above mentioned applications. In particular, the advent of mobile applications has generated a great amount of interest in the design of low-power VLSI communications systems. Even in tethered applications, a low-power solution provides the added bene ts of increased reliability and reduced packaging costs.
Design of low-power VLSI systems is presently an active area of research [5] [6] 16] . Power-reduction techniques have been proposed at all levels of the design hierarchy beginning with algorithms and architectures and ending with circuits and technological innovations. Existing techniques include those at the algorithmic level (such as reduced complexity algorithms 5]), architectural level (such as pipelining 25, 32] 11] . It is now well recognized that an astute algorithmic and architectural design can have a large impact on the nal power dissipation characteristics of the fabricated VLSI solution. In this paper, we will investigate algorithms and architectures for low-power and high-speed adaptive lters.
Adaptive equalizers are a major component of receivers in modern day communications systems accounting up to 90% of the gate-count 38]. They are employed to combat various channel impairments such as intersymbol interference (ISI), channel variations, crosstalk, timing jitter etc.. With the drive towards increasingly higher transmission rates there is a corresponding increase in the complexity of the adaptive receivers. Hence, there is a tremendous need for power, area and speed optimized adaptive equalizer architectures.
Traditionally, the focus in algorithm design has been to obtain performance in terms of better signalto-noise ratios (SNR) and/or bit-error rates (BER). The present trend is to trade-o a small amount of performance via algorithm transformation techniques 31] for a much superior VLSI architecture. Algorithm transformation techniques 6,31] such as look-ahead 32], relaxed look-ahead 37], block-processing 33], associativity 36], unfolding 15, 34] , folding 35] , retiming 21 ] have all been employed to design high-speed algorithms and architectures. Low-power operation was then achieved by trading o excess speed with power.
Of particular interest is a class of transformations known as algebraic transformations 36]. These transformations have been proposed to achieve arbitrarily high speed-ups in recursive algorithms. Strength reduction 5] is an algebraic transformation, which has been applied at the architectural level to trade-o multiplications with additions. This results in an overall savings in area and power as multipliers are more expensive (both in terms of area and power) than adders. A key contribution of this paper is the application of the strength reduction transformation at the algorithmic level (instead of the architectural level) to obtain low-power adaptive lter algorithms. An algorithmic level application of strength reduction is shown to be much more e ective in achieving power reduction as compared to an architectural level application.
for 51:84 Mb/s ATM-LAN environment.
II. PRELIMINARIES
In this section, we will review algebraic transformations and relaxed look-ahead pipelining. We start with the description of the strength reduction transformation and its relation to low-power operation. Next, we will illustrate the relaxed look-ahead form of pipelining and present the pipelined LMS algorithm 37].
A. Algebraic Transformations
Algebraic transformations are an important class of architectural level transformations, which have been proposed for high-speed 36] and for low-power 5]. These transformations rely on the fact that most linear DSP algorithms can be expressed in terms of multiply-add operations. Hence, algebraic transformations such as associativity 6,36], distributivity 36], common subexpression replication, common subexpression elimination, manifest expression elimination, and commutativity can be employed to improve either the throughput or reduce the complexity of the algorithm under consideration. In particular, the strength reduction transformation trades o high-complexity multiply operations with low-complexity add operations thus achieving low-power. In this paper, we will consider the strength reduction transformation and its role in achieving low-power.
Consider the problem of computing the product of two complex numbers (a + jb) and (c + jd) as shown below (a + jb)(c + jd) = (ac ? bd) + j(ad + bc):
From (2.1), a direct-mapped architectural implementation would require a total of four real multiplications and two real additions to compute the complex product. However, it is possible to reduce this complexity via strength reduction 4, 5] . Application of strength reduction involves reformulating (2.1) as follows As can be seen from (2.2) that the number of real multiplications is three and the number of additions is ve. Therefore, this form of strength reduction reduces the number of multipliers by one at the expense of three additional adders. Typically, multiplications are more expensive than additions and hence we achieve an overall savings in hardware.
Comparing (2.1) and (2.2), we nd that the strength reduction transformation also increases the critical path length, where the critical path is de ned as the longest path from the input to the output. The critical path computation time of the original system (T c;o ) and that of the strength-reduced system (T c;sr ) is given by, where T m and T a are two-operand multiply and add times, respectively. This is a drawback of the strength reduction transformation, which makes it undesirable in high-speed applications of interest in this paper. However, this problem can be easily solved by employing throughput enhancing techniques such as pipelining. The dynamic power dissipation P D in CMOS technology is given by
where C L is the average capacitance being switched, V dd is the supply voltage and f is the frequency of operation. Most of the existing power reduction techniques involve reducing one or more of the three quantities C L , V dd and f. The strength reduction transformation achieves low-power by reduction of arithmetic operations, which corresponds to the reduction of C L in (2.4). In order to estimate the power savings due to this transformation, we assume that the e ective capacitance of a two-operand multiplier is a factor K C times that of a two-operand adder. The factor K C depends upon the relative precisions of the multiplier and the adder and their respective implementation styles. It can be seen from (2.1) and (2. where P D (original) and P D (strength?reduced) are the dynamic power dissipation of the original (see (2.1)) and strength-reduced (see (2.2)) algorithms. From (2.5(b)), it is clear that the strength-reduced architecture will achieve power savings as long as K C > 3. Furthermore, it is clear from (2.5) , that the power savings approach an asymptotic value of 25% as K C increases. If we assume array-based multiplier structures, then K C is approximately equal to N B , where N B is the number of bits required to represent one input operand. Hence, it would be bene cial to employ the proposed transformation as along as the adders and multipliers have inputs with four or more bits. This is typically the case in the applications of interest where the required SNR dictates 7?8 bits of input precision. It can be easily checked that initially the power savings increases rapidly as a function of K C with more than 15% savings obtained with K C = 10.
B. Pipelining with Relaxed Look-ahead
The relaxed look-ahead pipelining technique 37] allows very high-sampling rates to be achieved with minimal hardware overhead. As mentioned before, the relaxed look-ahead technique is an approximation to the look-ahead technique 32]. Many approximations (also referred to as relaxations) can be formulated. However, we will consider only the delay and sum relaxations, which have proved to be very e ective in pipelining the LMS 39] algorithm.
Consider the rst-order recursion w(n) = w(n ? 1) + a(n)x(n):
The computation time of (2.6) is lower bounded by a single add time. Next, we apply an M-step look-ahead to (2.6) This transformation introduces M latches into the recursive loop, which can be retimed 21] to attain M-level pipelining of the add operation. Note that this transformation has not altered the input-output behavior. This invariance with respect to the input-output behavior has been achieved at the expense of the lookahead overhead term (the second term in (2.7)), which can be expensive. The relaxed look-ahead technique involves approximating architectures such as those described by (2.7), which have been derived via look-ahead technique. Delay and sum relaxations are two possible approximations, which will be described next. This relaxation can be justi ed if the average value of the product a(n)x(n) is slowly varying and simulations for LMS lters indicate this to be a good approximation. In addition to the two relaxations presented above, other relaxations can be de ned by approximating the algorithm obtained via application of look-ahead. The application of these relaxations, individually or in di erent combinations, results in a rich variety of architectures. However, these architectures will have di erent convergence properties and it is necessary to analyze their convergence behavior.
The delay and sum relaxations have been employed to pipeline the LMS algorithm 37]. Consider the serial LMS (SLMS) lter described by the following equations W(n) = W(n ? 1) + e(n)X(n); e(n) = d(n) ? W T (n ? 1)X(n) (2:10) where W(n) is the weight vector, X(n) is the input vector, e(n) is the adaptation error, is the step-size, and d(n) is the desired signal. The critical path for the serial LMS 39] is given by T c;SLMS = 2T m + (N + 1)T a ; (2:11) where N is equal to the number of taps in the lter block (or F-block). The application of relaxed look-ahead requires a subsequent convergence analysis of the pipelined lter. This analysis has been done in 39] and the interested reader is referred to 37,39] for details. In this paper, we will employ the relaxed look-ahead pipelined LMS lter to obtain pipelined lter architectures. Note that the increased throughput due to pipelining can be employed to :1.) meet the speed requirements, 2.) reduce power (in combination with power supply scaling) and 3.) reduce area (in combination with folding transformation 35]).
III. ALGEBRAIC TRANSFORMATIONS FOR LOW-POWER
Algebraic transformations have been applied at the architectural level in the past 6, 36] . While the strength reduction transformation in (2.2) can also be applied at the architectural level for increased hardware savings, we propose to apply them at the algorithmic level. It will be seen that the impact on the hardware requirements is much greater when the proposed strength reduction transformation is applied at the algorithmic level. In particular, we will assume that a passband digital communication system such as quadrature amplitude modulation (QAM) 14] or carrierless amplitude/phase (CAP) modulation 45] is being employed. In this situation, the receiver processes a two-dimensional signal with a two-dimensional lter. This results in the traditional cross-coupled structure, which will be the starting point of our work.
A. Traditional Cross-Coupled Equalizer Architecture
The output of the ltering block in an LMS algorithm can be written as y(n) = W T (n ? 1)X(n):
Clearly, if the input X(n) and the lter W(n) are complex quantities then we can apply the strength reduction transformation (2.2) to the polynomial multiplication in (3.1) to obtain a low power architecture. Modulation schemes such as quadrature amplitude modulation QAM 14] and CAP 45] employ a twodimensional signal constellation, which can be represented as a complex signal. If a complex lter is to be implemented then we can represent its output as a complex polynomial product. Furthermore, if the transformation in (2.2) is employed then we would need only three real lters (instead of four as in (2.1)).
Each real lter requires N multiplications and N ? 1 additions. Therefore, the application of the proposed transformation in (2.2) would then save a substantial amount of hardware.
Let the lter input be a complex signalX(n) de ned as X(n) = X r (n) + jX i (n);
where X r (n) and X i (n) are the real and imaginary parts, respectively. Furthermore, if the lter is also complex i.e.,W(n) = c(n) + jd(n), then its outputỹ(n) can be obtained as follows
whereW H represents the Hermitian (transpose and complex conjugate) of the matrixW. A direct implementation of (3.3) results in the traditional cross-coupled structure shown in Fig. 1 . This structure requires four FIR lters and two output adders, which amounts to 4N ? 2 adders and 4N multipliers. If the channel impairments include severe ISI and/or multipath, then the number of taps necessary can be quite large resulting in a high-complexity and high power dissipation.
In the adaptive case, a weight-update block (or WUD-block) would be needed to automatically compute the coe cients of the lter. This can be done by implementing a complex version of (2.10) as follows:
whereẽ(n) = e r (n) + je i (n), e r (n) = Q y r (n)] ? y r (n), e i (n) = Q y i (n)] ? y i (n), Q :] is the output of the slicer, andẽ represents the complex conjugate ofẽ. Next, we substitute these de nitions ofW(n),ẽ(n), andX(n) into (3.4) to obtain the following two real update equations:
c(n) = c(n ? 1) + e r (n)X r (n) + e i (n)X i (n)]
The WUD-block architecture for computing (3.5) is shown in Fig. 2 . It is clear that the hardware requirements are 4N + 2 adders and 4N multipliers for an N-tap two-dimensional lter. In the next subsection, we will present a low-power adaptive lter architecture using strength reduction.
B. Low-power Equalizer Architecture via Strength-Reduction
Observing (3.3-3.4) it is clear that strength reduction transformation (2.2) can be applied to the two complex multiplications present in it. We will see that this application of the transformation at the algorithmic level is much more e ective in reducing power as opposed to an architectural level application. Applying the proposed transformation to (3.3) rst, we obtaiñ
where
where X 1 (n) = X r (n) ? X i (n), c 1 (n) = c(n) + d(n), and d 1 (n) = c(n) ? d(n). The proposed architecture (see Fig. 3 ) requires three lters and two output adders. This corresponds to 4N adders and 3N multipliers, which is approximately a 25% reduction in the hardware as compared with the traditional structure (see Fig. 1 ). It, therefore, represents an attractive alternative from a VLSI perspective.
We now consider the adaptive version and speci cally analyze the WUD-block. From (3.7) and Adding (3.5(a)) to (3.5(b)), we obtain the update equation for c 1 (n ? 1) as follows c 1 (n) = c 1 (n ? 1) + e r (n)(X r (n) + X i (n)) ? e i (n)(X r (n) ? X i (n))] It is now easy to show that (3.8) and (3.9) can be written in the following complex form:
W 1 (n) =W 1 (n ? 1) + ẽ(n) (X r (n) + X i (n)) + j(X r (n) ? X i (n))];
(3:10) whereW 1 (n) = c 1 (n) + jd 1 (n). We can now apply the strength reduction transformation to the complex product in (3.10) to obtain a low-power WUD architecture. Doing so results in the following set of equations which describe the strength reduced WUD block, W 1 (n) =W 1 (n ? 1) + eX 1 (n) + eX 3 (n) + j(eX 2 (n) + eX 3 (n))]; (3:11) where eX 1 (n) = 2e r (n)X i (n) (3:12(a)) eX 2 (n) = 2e i (n)X r (n) (3:12(b)) eX 3 (n) = e r (n) ? e i (n)] X r (n) ? X i (n)] = e 1 (n)X 1 (n);
where e 1 (n) = e r (n) ? e i (n) and X 1 (n) = X r (n) ? X i (n). The architecture corresponding to (3.11) and (3.12) is shown in Fig. 4 . It can be seen that this WUD architecture requires only 3N multipliers and 4N + 3 adders. Thus, the number of multipliers are reduced by one fourth at the expense of an additional adder as compared to the traditional WUD architecture (see Fig. 2 ).
Combining the architecture for the F-block in Fig. 3 and that for the WUD-block in Fig. 4 , we obtain the proposed strength-reduced low-power adaptive lter architecture in Fig. 5 . A complete description of the low-power adaptive lter architecture is given by ((3.6-3.7) and (3.11-3.12)). In Fig. 5 , we show the overall block diagram of the adaptive lter, where FR-block and WUDR-block compute (3.7(a)) and (3.12(a)), respectively. Similarly, FI-block and WUDI-block compute (3.7(b)) and (3.12(b)), respectively. Furthermore, the FRI and WUDRI blocks compute (3.7(c)) and (3.12(c)). Note that in Fig. 5 , we have separated the slicer and the error computation adders from the WUDR and WUDI blocks. This is done only to depict the error feedback loop clearly. Henceforth, we will refer to the FR, FI, and FRI-blocks as the F-blocks and the WUDR, WUDI and WUDRI-blocks as the WUD-blocks.
C. Comparison
A comparison of the hardware requirements for the traditional cross-coupled structure and the proposed architecture is presented in Table I . It can be seen that for large values of N, the proposed architecture results in a 25% reduction in the number of multipliers at the expense of three additional adders. This reduction in hardware provides the dual bene ts of lower area and lower power. Power reduction is derived from the fact that as the switching capacitance in (2.4) is reduced.
We will now derive a more accurate estimate of the power savings achieved by the proposed low-power lter architecture. In order to do so it is necessary to take into account the fact that in Fig. 1-5 , the precision requirements on the adders in the WUD-blocks (excluding the adders which compute the error) are typically twice that of the rest of the adders. There are 4N such adders in the traditional and the proposed structure. From Table I , we can see that the traditional architecture will have a switching capacitance, which is proportional to 8NK C +12N. On the other hand, the switching capacitance for the proposed architecture can be shown to be proportional to 6NK C + 12N + 3. Substituting these values into the de nition of PS in (2.5(a)), we can show that the power savings PS due to the proposed lter architecture is given by PS = (2NK C ? 3) 4(2NK C + 3N) ; (3:13) where K C is the ratio of the e ective capacitance of a two-operand multiplier to that of a two-operand F-block adder. Note that the savings predicted by (3.13) are somewhat optimistic as the e ect of latches have not been included. From (3.13), we can see that for large values of K C and N the power reduction approaches an asymptotic value of 25%. A plot of (3.13) would indicate that most of the power savings would be obtained for values of N approximately equal to 10. Even with a typical values of K C = 8 and N = 32 in (3.13), the resulting area and power savings equal 21%. Thus, the bene ts of the proposed architecture are obtained quite easily.
Applying the strength reduction transformation to the traditional cross-coupled architecture (see Fig.  1-2) at the architectural level is also possible. We will now show that an architectural application of strengthreduction is not as e ective in reducing power dissipation as an algorithmic application proposed here. We note that in the traditional cross-coupled architecture, there are N complex multipliers and N ? 1 complex adders for the F-block and N complex multipliers and N + 1 complex adders (N adders being double precision) for WUD-block. This gives the total capacitance of 8NK C + 12N times the capacitance of one F-block adder. Now, if we apply strength reduction to each of the complex multiplications in the F-block and WUD-block, we get the switched capacitance equal to 6NK C + 21N times the capacitance of one F-block adder. Substituting the switched capacitances for the two architectures into (2.5(a)), we obtain PS = 2K C ? 9 8K C + 12 ; (3:14) which is equal to 9:21% for K C = 8. Thus, the application of the transformation at the algorithmic level is much more e ective in reducing power as compared to the application at the architectural level.
IV. RELAXED LOOK-AHEAD PIPELINED ARCHITECTURES
In the previous section, the traditional cross-coupled architecture (see Fig. 1 and Fig. 2 ) and the proposed low-power architecture (see Fig. 5 ) were described and compared. It was seen that the proposed architecture provides a power savings of approximately 25% over the traditional structure. In the adaptive case, both architectures su er from a throughput bottleneck due to the error feedback loop. This is clearly seen (Fig. 5) in case of the strength-reduced architecture. In this section, we propose a solution to this problem by developing a pipelined version of the strength-reduced architecture shown in Fig. 5 . We shall see that this process will allow us to tradeo area and power with speed, thus achieving further power and area savings.
A. Serial Equalizer Architecture (SEA)
We will refer to the architecture in Fig. 5 as the serial (or unpipelined) lter architecture (SEA), which is also described by ((3.6-3.7) and (3.11-3.12)). In order to pipeline the SEA architecture we will rewrite the equations for SEA as follows: y 1 (n) = c T 1 (n ? 1)X r (n) (4:1(a)) c 1 (n) = c 1 (n ? 1) + 2e r (n)X i (n) + e 1 (n)X 1 (n)] where T m and T a are two-operand multiply and single-precision add times, respectively. For applications that require large values of N, the lower-bound on T SEA in (4.2) may prevent a feasible implementation. This is particularly true for the high-bit rate communications systems mentioned in the introduction. Note that this problem is also present in case of the traditional cross-coupled architecture in Fig. 1-2 . We present a solution to this problem in the next sub-section, where a pipelined lter architecture (PEA) is derived.
B. Pipelined Equalizer Architecture (PEA)
In order to derive the PEA, we will start with the SEA equations (4.1) and then apply relaxed look- 3(b) ) and (4.3(d)) can be realized by computing the product within the summation and then passing it through an FIR lter whose coe cients are all equal to unity. This FIR lter can be realized in an equivalent transpose form. In that case, the computational delay due to the summation would be independent of LA.
However, an overhead of 2N(LA ? 1) adders will result.
The input sample-period T PEA depends upon the manner in which the algorithmic delays D 1 and D 2 have been retimed 21] . Assuming that retiming has been done in a uniform fashion (i.e., all stages have equal computation times), the lower-bound on T PEA is given by , pipelining resulted in an additional two symbol period delay, which was a small fraction of the overall point-to-point delay. This would be true for most of the applications mentioned in the introduction, and hence pipelining is an e ective method to enhance the throughput.
C. Convergence Analysis
Convergence analysis of the pipelined strength reduced architecture can be done in a fashion similar to that of the pipelined LMS 36] . For the sake of mathematical tractability, we have analyzed a special case of the proposed architecture where LA = 1 and D 1 is a multiple of D 2 i.e., D 1 = KD 2 . For the details of the convergence analysis, the reader is refered to 36]. We will present only the nal results in this subsection.
The following de nitions are necessary before presenting the analytical expressions. In (4.8), the linear term in b dominates the quadratic term. Hence, the misadjustment increases with K. In section VI, we will see that the misadjustment does not change substantially as K varies and therefore can be considered to be approximately constant.
D. Power-Reduction
As mentioned in section II(B), pipelining along with power-supply voltage reduction has been proposed 5] as a technique for reducing the power dissipation. In CMOS technology, scaling the power-supply by a factor K V can be shown to reduce the speed of operation by the same factor especially for small values of K V . However, pipelined architectures can easily compensate for this loss in speed by the reduction of the critical path length. Hence, some of the increase in throughput due to pipelining can be traded-o with power reduction. An implicit assumption in this approach to low-power operation is the requirement that the pipelining overhead be minimal. As was seen in this section, relaxed look-ahead pipelining results in an overhead of 2N(LA ? 1) adders and 5D 1 + 2D 2 latches (without retiming). This implies that the average switching capacitance in (2.4) would increase. Employing the fact that these additional adders are double-precision, we get the power savings PS with respect to the cross-coupled architecture as follows:
where K L is the ratio of the e ective capacitance of a 1-b latch to that of a 1-b adder, and K V > 1 is the factor by which the power-supply is scaled. Employing typical values of K V = 5V=3:3V , K C = 8, K L = 1=3, N = 32, D 1 = 48, D 2 = 2 and LA = 3 in (4.9), we obtain a total power savings of approximately 60% over the traditional cross-coupled architecture. Clearly, 21% of the power savings are obtained from the strength reduction transformation, while the rest (39%) is due to power-supply scaling. Note that, this increased power savings is achieved in spite of the additional 2N(LA ? 1) adders required due to relaxed look-ahead pipelining.
Based upon the transistor threshold voltages, it has been shown in 5] that values of K V = 3 are possible with present CMOS technology. With this value of K V , (4.9) predicts a power savings of 90%, which is a signi cant reduction. Thus, a judicious application of algebraic transformations (strength reduction), algorithm transformations (pipelining) and power-supply scaling can result in substantial power reduction.
V. SIMULATION RESULTS
In this section, we present simulations results for verifying the performance of the proposed adaptive lter architecture. In Experiment A, we employ the proposed architecture in a system identi cation set-up in order to verify (4.6) and (4.8) . In Experiment B, we demonstrate the use of sum relaxation to improve the convergence speed as the level of pipelining increases.
A. Experiment A In this experiment, we employ a system identi cation setup in order to verify (4.6) and (4.8). Such a set-up emulates those communications systems, which employ echo cancellers or near-end cross talk (NEXT) cancellers. The system to be identi ed was a 50 th order complex FIR lter and the proposed adaptive lter had N = 24 complex taps. The results were averaged over 50 independent trials and are shown in Table II for values of D 2 = 2 and D 2 = 4. The theoretical and measured values of M and max (the maximum value of for convergence) are tabulated in Table II .
From Table II , we notice that the theoretical and measured values of M and max match closely and have the same trend as the level of pipelining increases. As predicted by (4.6) and (4.8), we can observe a slight degradation in M and a decrease in max as K is increased.
B. Experiment B
Consider the system identi cation set-up of Experiment A with N = 32. Further assume that T m = 20ns and T a = 10:26ns, where T m and T a were de ned in section IV as the computation times of two-operand multiply and single-precision add operations, respectively. From (4.2), we obtain the clock period of the serial architecture (see Fig. 5 ) as T SEA = 440ns. If the application demands a sample period of 4ns, then the serial architecture cannot meet this throughput rate. Hence, we need to employ the pipelined architecture in Fig. 6 with a speed-up of 110. In particular, the double precision adders in the WUD blocks need to be 5,1) ). It is clear that the pipelined architecture has a slower convergence time. However, by employing sum relaxation with LA = 2, we obtain the third MSE plot ((109,5,2)) in Fig. 7 where the convergence speed is now signi cantly improved. Note that the sum relaxed architecture has a higher value of D 1 = 109. This is due to the fact that with LA = 2, the critical path time of the pipelined architecture is increased (see (4.4) ) and therefore needs to be accounted for.
VI. APPLICATION TO 51.84 Mb/s ATM-LAN
In this section, we will study the performance of the proposed low-power adaptive lter architecture in a high-speed digital communications system. In particular, we will employ the proposed architecture as an equalizer in a CAP-QAM modulation scheme for a data-rate of 51:84 Mb/s over 100 meters of unshielded twisted-pair (UTP3) wiring. As mentioned before, 16-CAP is currently the line-code of choice 18] in this application.
While the standard does specify the line code to be 16-CAP, there is a lot of exibility in deciding the transmitter and receiver structures. Hence, in this section we have assumed a CAP transmitter and a QAM receiver. Our only reason for choosing QAM at the receiver is to be able to apply the proposed architecture to the QAM equalizer, which has been traditionally implemented as a cross-coupled structure. The overall communication link is shown in Fig. 8 , where the input bit-stream is accepted by a CAP transmitter and transformed into a format suitable for transmission over the channel. In addition to attenuation and dispersion, the received signal has near-end cross talk (NEXT) superimposed upon it. The NEXT impairment occurs due to electromagnetic coupling between the local transmitted signal and the received signal. This coupling is caused by the physical proximity of the wire pairs for transmission in the two directions. The models for NEXT can be obtained from 45] .
Except for the essentials, we will skip most of the details regarding CAP 45] and QAM 14] . In section VI(A), we brie y describe the CAP transmitter and then the QAM receiver is described in section VI(B). Finally, the simulation results are presented in section VI(C).
A. The CAP Transmitter
The block diagram of a digital CAP transmitter is shown in Fig. 9 . The bit stream to be transmitted is rst passed through a scrambler. The scrambled bits are then fed into an encoder, which maps blocks of m bits onto one of k = 2 m di erent complex symbols a(n) = a r (n) + ja i (n) for a k-CAP line code. In this study we have employed k = 16. The symbols a r (n) and a i (n) are processed by digital shaping lters. This requires that the shaping lters be operated at a sampling frequency f s , which is at least twice the maximum frequency component of the transmit spectrum. The outputs of the lters are subtracted and the result is passed through a digital-to-analog (D/A) converter, which is followed by an interpolating low-pass lter (LPF). It can be seen that most of the signal processing at the transmitter (including transmit shaping) is done in the digital domain, which permits a robust VLSI implementation.
The signal at the output of the CAP transmitter (see Fig. 9 where T is the symbol period, a r (n) and a i (n) are discrete multilevel symbols, which are sent in symbol period nT, and p(t) andp(t) are the impulse responses of in-phase and quadrature passband shaping lters, respectively. The passband pulses p(t) andp(t) in (6.1) are de ned as p(t) 4 = g(t)cos(2 f c t)p(t) 4 = g(t)sin(2 f c t); (6:2) where g(t) is a baseband pulse and f c is a frequency that is larger than the largest frequency component in g(t). The two impulse responses in (6.2) form a Hilbert pair i.e., their Fourier transforms have the same amplitude characteristics, while their phase characteristics di er by 90 o . Typically, the baseband pulse g(t) is a square-root raised cosine 14] pulse. The output spectrum is broadband with a bandwidth of 25:92 Mhz (see Fig. 10(a) ). While larger bandwidths are possible, the FCC Class B requirements restrict the signal energy to below 30 MHz. The bit rate of 51:84 Mb/s and 16-CAP signal constellation imply a symbol rate of 12:96 Mbaud. Hence, the chosen transmit spectrum has 100% excess bandwidth as shown in Fig. 10(a) . The received spectrum at the output of the channel (see Fig. 10(b) ) indicates the extent of propagation loss.
B. The QAM Receiver
The QAM receiver in Fig. 11 , rst demodulates the received signal (which is sampled at 51:84 Msamples/s) such that the signal at the output of the low-pass lters (LPF) has energy from DC to 12:96 Mhz. This allows us to downsample the LPF output by a factor of two. The resulting complex signal can then be ltered via the traditional cross-coupled architecture ( Fig. 1-2) or the proposed architecture (Fig. 5 ) operating at 25:92 Msamples/s. The equalizer outputs are sampled at the symbol rate of 12:96 MHz, which are then sliced to generate the detected symbols. The error across the slicer is employed to adapt the equalizer coe cients once every symbol period. The detected symbols are also decoded to generate the received bit-stream. The values of the step-size employed in the simulations were deliberately made power's of two so that the hardware implementation requires only shift-right operations. In particular, we employed gearshifting whereby the value of was halved after the rst 120; 000 symbols and again after 240; 000 symbols. Furthermore, the receive equalizer was chosen to have a span of 32 symbol periods so as to obtain a noise margin of about 2 dB. For simplicity, we have assumed that a training sequence is available for equalizer training.
Based upon the convergence analysis results of section IV(C), we can conclude that SNR o will degrade as the level of pipelining is increased. In Fig. 12 , we plot the SNR o with respect to the speed-up, where the speed-up is de ned as the ratio of T SEA (see (4.2)) to T PEA (see (4.4) ). It is clear from Fig. 12 that the SNR degrades by less than 0:8 dB for speed-ups of up to 156. Speed-ups of up to 50 or 60 may su ce for most applications. In that case, the loss in performance is less than 0:27 dB. Thus, the proposed architecture allows substantial speed-ups with negligible performance loss.
For all cases, SNR o increases to 20 dB within 4ms (approx 50; 000 symbols). This is indicated in Fig.  13 , where the convergence plot for the serial and pipelined (speed-up of 156) cases are shown. Also worth noting is the fact that the indicated values of SNR was achieved by running the simulation for 360; 000 symbols, which corresponds to approximately 28ms. For the application at hand, a convergence speed of a few hundred milliseconds is deemed acceptable.
Thus, we conclude that the proposed architecture is a viable alternative for QAM based receivers especially in an ATM-LAN environment.
VII. CONCLUSIONS
Application of strength reduction transformation 4, 5] at the algorithmic level (as opposed to the architectural level) has resulted in a low-power complex adaptive lter architecture. Power and area savings of approximately 21% was shown to be achievable. Relaxed look-ahead 37] pipelined architectures were then developed for achieving high-speed operation. An additional 39% power savings was achieved by scaling down the power-supply.
It must be mentioned that the low-power architecture presented in this paper is applicable to any communication system, which employs a two dimensional signal constellation. While we have demonstrated the application of the proposed architecture for 51:84 Mb/s ATM-LAN, numerous other applications exist. Our current research is being directed towards the design of low-power NEXT cancellers and adaptive DFE's for higher speed digital subscriber loops (such as 100 ? 155 Mb/s). Future extensions of this work include the study of nite-precision e ects of the proposed architecture and eventually an integrated circuit implementation. Development of optimal folding strategies is also a future goal in order to trade-o area with speed so that power, area and speed optimal systems can be implemented. Fig. 1 Traditional cross-coupled lter architecture. Fig. 2 Traditional weight-update block architecture. Fig. 3 Proposed lter architecture obtained via strength reduction. Fig. 4 Proposed weight-update block architecture obtained via strength reduction. Fig. 5 Block-diagram of the proposed adaptive lter architecture. Fig. 6 Block-diagram of the relaxed look-ahead pipelined adaptive lter architecture. Fig. 7 Sum relaxation. Fig. 8 Communication system block diagram for 51.84 Mb/s ATM-LAN. Fig. 9 The CAP transmitter. Fig. 10 Signal spectrum: (a) transmitter output and (b) channel output. Fig. 11 The QAM receiver. Fig. 12 SNR o vs. Speed-up. Fig. 13 Convergence curves for the error across the slicer.
