Abstract-In this correspondence, we compare the finite-precision requirements of the traditional cross-coupled (CC) and a low-power strength-reduced (SR) architectures. It is shown that the filter block (F block) coefficients in the SR architecture require 0.3 bits more than the corresponding block in the CC architecture. Similarly, the weight-update (WUD) block in the SR architecture is shown to require 0.5 bits fewer than the corresponding block in the CC architecture. This finite-precision architecture is then used as a near-end crosstalk (NEXT) canceller for 155.52 Mb/s ATM-LAN over unshielded twisted pair (UTP) category-3 cable. Simulation results are presented in support of the analysis.
I. INTRODUCTION
Strength reduction is an algebraic transformation that has been proposed [4] to trade off multipliers with adders in a complex multiplication, thereby achieving power reduction. In [7] , we proposed the application of strength reduction transformation at the Manuscript received November 12, 1996; revised December 11, 1997. This work was supported by the NSF CAREER Award MIP-9623737. The associate editor coordinating the review of this paper and approving it for publication was Dr. Konstantin Konstantinides.
The authors are with the Coordinated Science Laboratory and Electrical and Computer Engineering Department, University of Illinois at UrbanaChampaign, Urbana, IL 61801 USA (e-mail: mgoel@uivlsi.csl.uiuc.edu; shanbhag@uivlsi.csl.uiuc.edu).
Publisher Item Identifier S 1053-587X(98)03944-0.
algorithmic level to adaptive systems involving complex signals and filters. It was shown in [7] that the strength-reduced (SR) filter enables power savings of 21-25% over the traditional cross-coupled (CC) filter with no loss in performance. However, the application of strength reduction increases the critical path, and, hence, an inherently pipelined SR (PIPSR) architecture was also presented. Furthermore, by trading the throughput gained through pipelining with power supply scaling [4] , it was demonstrated that additional power savings of 40-69% are feasible. In this correspondence, we compare the finite-precision requirements of the SR and the PIPSR architectures developed in [7] with that of the CC architecture. It is shown that the precision requirements of the SR and PIPSR architectures are similar to those of the CC architecture. This makes the SR and the PIPSR architectures attractive alternatives to the traditional CC architecture for high bit-rate communications and digital signal processing applications.
In this correspondence, a linear model is employed for coefficient quantization noise. The filter (F) block precision B F is chosen such that the signal-to-quantization-noise-ratio (SQNR) is greater than the desired signal-to-noise ratio SNR o . [1] and [2] can be employed to provide a tighter bound on B WUD . However, the purpose of this paper is to compare the precision requirements for CC and SR architectures, and hence, we employ the analysis in [3] . This analysis provides useful design guidelines for applications such as those in digital subscriber loops where the final step sizes are reasonably large.
We demonstrate an application of the finite-precision SR architecture as a near-end crosstalk (NEXT) canceller for 155.52 Mb/s [6] ATM-LAN over 100 m of unshielded twisted pair category-3 (UTP-3) cable employing 64-CAP (carrierless amplitude/phase) modulation scheme. We present the simulation results for this application in order to determine the precision requirements of various signals and to support the analytical results presented in the correspondence.
The organization of the paper is as follows. In Section II, we present PIPSR adaptive filter architecture. In Section III, we determine the finite-precision requirements of CC, SR, and PIPSR architectures. Finally, in Section IV, the finite-precision architectures are employed as a near-end crosstalk (NEXT) canceller for 155.52 Mb/s ATM-LAN.
II. A PIPELINED STRENGTH-REDUCED (PIPSR) ADAPTIVE FILTER
In this section, we review the strength reduction transformation and development of the PIPSR architecture [7] from the CC architecture.
The product of two complex numbers (a + |b) and (c + |d) is given by (a + |b)(c + |d) = (ac 0 bd) + |(ad + bc):
1053-587X/98$10.00 © 1998 IEEE A direct-mapped architectural implementation would require a total of four real multiplications and two real additions to compute the complex product. Application of strength reduction involves reformulating the above multiplication as
where we see that strength reduction reduces the number of multipliers by one at the expense of three additional adders. Typically, multiplications are more expensive than additions, and hence, we achieve an overall savings in hardware. We now present the SR and the PIPSR architectures.
A. Strength-Reduced (SR) Architecture
The SR architecture [7] is obtained by applying strength reduction transformation at the algorithmic level instead of at the multiply-add level. Assume an N-tap adaptive filter implementing a complex LMS algorithm. Assume that the filter input is a complex signal X(n)given by X(n) = X r (n) + |X i (n), where X r (n) and X i (n) are the real and the imaginary parts of the input signal vector X(n).Furthermore, if the filter W(n) is also complex (W(n) = c(n)+|d(n)), then the complex LMS algorithm is given by
where step size;
In addition, e 3 (n) represents the complex conjugate of the signal e(n), and W H (n) represents the hermitian (complex conjugate transpose) of W(n).
From (2.2), we see that there are two complex inner products involved. Traditionally, the complex LMS algorithm is implemented via the CC architecture, which is described by
where e(n) = e r (n) + |e i (n), and the F-block output is given by y(n) = y r (n) + |y i (n). Equations (2.3a)-(2.3b) and (2.3c)-(2.3d) define the computations in the F-block and the WUD-block, respectively. A direct-mapped implementation of (2.3) would require 8N multipliers and 8N adders for power-of-two step sizes.
We see that (2.2) has two complex inner products and hence can benefit from the application of strength reduction. Doing so results in the following equations, which describe the F-block computations of the SR architecture [7] . We have
and y r (n) = y 1 (n) + y 3 (n) y i (n) = y 2 (n) + y 3 (n) (2.4b) where X1(n) = Xr(n) 0 Xi(n); c1(n) = c(n) + d(n), and d 1 (n) = c(n)0d(n). Similarly, the WUD computation is described 
where eX 1 (n) = 2e r (n)X i (n); eX 2 (n) = 2e i (n)X r (n); eX 3 (n) = e 1 (n)X 1 (n); e 1 (n) = e r (n) 0 e i (n); X 1 (n) = X r (n) 0Xi(n). It is easy to show that the SR architecture (see Fig. 1 ) requires only 6N multipliers and 8N + 3 adders for power-of-two step sizes. This is the reason why the SR architecture results in 21-25% power savings [7] over the CC architecture.
B. Pipelined Strength-Reduced (PIPSR) Architecture
The dotted line in Fig. 1 indicates the critical path of the SR architecture. As explained in [7] , both the SR as well as CC architectures are bounded by a maximum possible clock rate due the computations in this critical path. This throughput limitation is eliminated via the application of the relaxed look-ahead transformation [8] to the SR architecture [see (2.4) and (2.5)]. Application of relaxed look-ahead to the SR architecture in (2.4) and (2.5) results in the following equations that describe the F-block computations in the PIPSR architecture.
where D2 is the number of delays introduced before feeding the filter coefficients into the F-block. Similarly, the computation of the WUD block of the PIPSR architecture are given by
where eX1(n); eX2(n); and eX3(n) are defined in the previous subsection, D 1 0 are the delays introduced into the error feedback loop, and 0 < LA D 2 indicates the number of terms considered in the sum-relaxation. A block level implementation of the PIPSR architecture is shown in Fig. 2 , where D 1 and D 2 delays will be employed to pipeline the various operators such as adders and multipliers at a fine-grain level. The high-throughput of the PIPSR architecture can be traded off with supply voltage reduction resulting in additional power savings [7] of 40-69%. Therefore, the PIPSR architecture results in 60-90% power savings as compared to the serial CC architecture.
III. FINITE-PRECISION REQUIREMENTS
In this section, we will present a comparison of the precision requirements of the CC and SR architectures. We employ linear models [3] for the quantization noise. Further, the F-block coefficient precision, B F , is determined by treating F-block as a constant coefficient FIR filter and choosing JQ J , where JQ is the mean squared quantization error, and J is the output mean squared error (MSE) for floating-point algorithm. The condition J Q J guarantees that in case of an equalizer, the bit error rate (BER) of the fixed and floating-point receivers are close to each other.
The stopping criterion [3] is used to determine the WUD-block coefficient precision, B WUD . The stopping criterion is based on the fact that the filter will stop adapting if the correction term (e(n)x(n) in real LMS adaptive filter) drops below LSB=2, where LSB is least magnitude representable by the chosen precision. The precision assigned should be sufficient for the adaptive filter to converge to the specified MSE, J o .
A. F-Block Precision
Define Bx;y to be the coefficient precision (including sign-bit) in x block of y architecture. Let N be the number of taps in adaptive filter.
In addition, let J be the infinite-precision MSE (IEEE 754 floatingpoint format offers resolution up to 10 037 and can be safely treated as infinite precision). If 2 d is the power of symbol constellation (or the desired signal), the output SNR is given by 2 d =J . Now, we determine the quantization error due to finite-precision implementation of the F-block. The additional error due to the finiteprecision F-block implementation is given by E[1y 2 r (n)+1y 2 i (n)], where 1yr(n) and 1yi(n) are the quantization errors in yr(n) and y i (n). For CC architecture, it can be seen from (2.3a)-(2.3b) that these errors are given by 1y r (n) = 1c
where 1c(n) and 1d(n) are the errors due to quantization of coefficients c(n) and d(n), respectively. Now, assume that all the quantization errors 1c i (n) and 1d j (n) are mutually independent. In addition, assume a uniform noise model for the quantization error and noise variance of 2 F;CC = 2 02B =12. Then, the quantization error JQ is given by
where R = E[X(n)X H (n)] is the input correlation matrix. Now, we can make the performance of the finite-precision F-block arbitrarily close to that of the infinite-precision F-block by choosing a factor This shows that the F-block in the SR architecture requires at the most one bit more than in the CC architecture. The quantization error due to finite-precision implementation of F-block in PIPSR architecture [see (2.6)] is same as that of the SR architecture because both architectures involve same computations in the F-block. Therefore, for given , F-block precision in PIPSR architecture is also given by 
B. WUD-Block Precision
The finite-precision WUD-block can be analyzed by using linear model for coefficient quantization noise. Then, B WUD is chosen based on the stopping criterion [3] . For CC architecture, the correction terms are given by (2.3c)-(2.3d). Therefore, the adaptive filter will stop converging if the following two conditions are simultaneously satisfied.
jer(n)xr(n) + ei(n)xi(n)j < 2 0B
je r (n)x i (n) 0 e i (n)x r (n)j < 2 0B A similar expression can be found from (2.5) for the coefficient precision of the WUD-block in the SR architecture. The stopping criterion in this case is given by jex 1 (n) + ex 3 (n)j < 2 0B (3.12a) jex 2 (n) + ex 3 (n)j < 2 0B (3.12b) where ex 1 (n); ex 2 (n); and ex 3 (n) are the elements of the vectors eX1(n); eX2(n); and eX3(n) [see (2.5)], respectively. Squaring (3.12a) and (3.12b), adding and using stochastic estimates, we get Comparing (3.11) and (3.14), we see that the precision requirements for WUD-block in the SR architecture are 0.5 bits less than that of the CC architecture. This is indeed an attractive result given that the SR architecture also enables power savings of 21-25% [7] .
The precision requirements for WUD block of PIPSR architecture [see (2.7)] can be determined by replacing in (3.14) by LA.
In most of the designs, we choose LA = 1 or 2 to minimize the hardware overhead. Therefore, we conclude that the finite-precision requirements of the PIPSR architecture are similar to that of the SR architecture. 
IV. APPLICATION TO 155.52 Mb/s ATM-LAN
In this section, we employ the finite-precision PIPSR architecture as a NEXT canceller for the 155.52 Mb/s ATM-LAN [6] over UTP-3 cable and present the simulation results. The basic transceiver block diagram is presented in Fig. 3 . Two pairs are used for the dualsimplex transmission, where each direction of transmission uses a different pair though in the same cable. Two main impairments are propagation loss and NEXT. We assume the worst-case EIA/TIA model for the simulations presented in this paper, where the channel and NEXT models are obtained from [6] .
A. 155.52 Mb/s ATM-LAN Transceiver
For the details on the transmitter and the equalizer block diagrams in Fig. 3 , refer to [6] Fig. 3 ), the received signal is distorted further due to the superimposition of the NEXT signal. This composite signal is processed by the fractionally spaced linear equalizer (FSLE), which is a pair of adaptive filters. In addition, the local transmitted symbols are passed through a complex adaptive NEXT canceller, which tries to cancel the effect of the NEXT in the received signal. Note that the NEXT canceller operates at the baud rate, which is three times lower than the sampling frequency of the FSLE.
In this section, we will employ the fixed-point architectures presented in this paper as NEXT cancellers. The simulation procedure we adopt is to let the FSLE converge in presence of the distortion introduced by 100 m UTP-3 cable in the absence of NEXT. Next, the coefficients of the FSLE are frozen, and the local transmitter and the NEXT canceller are activated. This introduces the NEXT in the received signal, which the NEXT canceller attempts to cancel with an appropriately chosen bulk delay. In this section, we determine the precision requirements of CC, SR, and PIPSR NEXT canceller architectures. We will assume that the PIPSR NEXT canceller has been obtained by pipelining the serial SR architecture to the pipelining level of 105 by using D1 = 109; D2 = 5; and LA = 2 (see [7] for more details regarding this choice of D 1 ; D 2 ; and LA). 
B. Simulation Results
F-block precisions can be determined by employing (3.4) for CC architecture and (3.6) for SR. On substituting above given parameters, we obtain B F;CC = 8:87 and B F;SR = 9:17. These values are supported via simulation results plotted in Fig. 4 , which shows the variation of SNR slicer with the F-block precision in CC, and SR architectures. Desired SNR is attained at about 9 bits of precision for CC architecture and 10 bits for SR architecture. Fig. 4 also confirms that the coefficient precision required in F-block for the SR architecture is at the most 1 bit more as compared with the CC architecture. Recall that this conclusion was also obtained from (3.7) .
Similarly, the coefficient precision in the WUD block can be determined by employing (3.11) for CC and (3.14) for SR. For proper convergence, was chosen to be 0.0007 for CC and SR implementations. The B WUD precisions are determined from (3.11) and (3.14) to be B WUD;CC = 9:45 and B WUD;SR = 8:95. This is confirmed by simulation results in Fig. 5 , where the desired performance is achieved with 9 bits of precision for both CC and SR architectures.
We now consider the coefficient precisions in the PIPSR architecture. The F-block precision B F;PIPSR is obtained by substituting Therefore, we conclude that PIPSR architecture is a viable lowpower solution for 155.52 Mb/s ATM-LAN and other digital subscriber loop applications.
I. INTRODUCTION
We study two novel adaptive algorithms for generalized eigendecomposition that are derived from a two-layer linear heteroassociative neural network. We discuss applications of these algorithms in an adaptive beamforming example to solve the near-far problem in code-division-multiple-access (CDMA) based cellular communications. Note that the well-studied topic of principal component analysis [1] provides adaptive algorithms for eigendecomposition of a correlation matrix A, which is the limit matrix of a single sequence of random matrices. We, on the other hand, provide adaptive algorithms for generalized eigendecomposition of a matrix Manuscript received July 24, 1997; revised October 13, 1997. This work was supported in part by NSF Grants ECS-9308814 and ECS-9523423. The associate editor coordinating the review of this paper and approving it for publication was Prof. Yu-Hen Hu.
C. Chatterjee is with GDE Systems Inc., San Diego, CA 92127 USA. V. P. Roychowdhury is with the Electrical Engineering Department, University of California, Los Angeles, CA 90095 USA.
Publisher Item Identifier S 1053-587X(98)03937-3.
pair (A; B), which are the limit matrices of two sequences of random matrices.
A. Adaptive Beamforming for CDMA Based Cellular: A Case Study
As an example of an application that requires adaptive generalized eigendecomposition, we study the problem of on-line cochannel interference cancellation to solve the near-far problem in CDMAbased cellular communications. A number of nonadaptive methods have been proposed [3] , [5] - [7] to solve this problem. A common scheme uses multiple (say, m) antennas to receive the signal at the base. The output of each antenna is put through a matched filter corresponding to the code of the desired user [7] -[9] (see Fig. 1 ). Although there are many methods to extract the desired signal at the base, we next consider a particular method that has been studied by several researchers [8] , [9] . In the IS-95 standard, the bit period of the signal is on the order of 100 s in duration. Within each bit period, there is roughly a 10 s or so interval during which the desired filtered signal occurs. During this period of time, the signal plus interference correlation matrix A is estimated. In the remaining 90 s or so, we estimate the interference correlation matrix B:
Given the correlation matrices A and B of signal plus interference and interference, respectively, we compute the weight vector In an attempt to simplify this problem, an alternative method In all of the above-mentioned schemes, interference cancellation can be achieved by first computing the matrix pencil (A; B) after collecting all of the samples and then the application of a numerical procedure [2] , i.e., by working in a batch fashion. If the principal generalized eigenvectors are computed in a batch mode, the time delay needed to make a decision would not only include bit times needed to average the spatial correlation matrices but the subsequent time required to compute the generalized eigenvectors as well. In addition, the batch mode operation will not, in general, exploit the fact that there is a gradual time variation of the weight vector w w w in a urban mobile environment and that we need to recompute w w w after every few (say 4) bits. In order to reduce this computation and obtain effective interference cancellation, an adaptive (i.e., on-
