The problem of weight extraction for systolic adaptive beamforming systems has been the subject of intense research since the first well-known work of Gentleman and Kung on recursive least squares (RLS) systolic arrays [l, 21. Their approach is based on the QR decomposition (QRD) which is numerically stable. Although the QRD updates proposed in [l] are pipelined on a triangular array, their fully pipelined weight extraction by using the RLS systolic array consists of the two separate steps of QR-updates and backward substitution and has been shown to be unrealizable [3].
A major challenge in implementing the RLS algorithm for the multiple sidelobe cancellation adaptive beamforming system using systolic array processors (SAPS) is to design a single fully pipelined structure. A critical obstruction appears because the process of the QR-updates runs from the upper-left corner to lower-right corner of the array, while the process of the backward substitution runs in exactly the opposite direction as pointed out in [3]. Much research has been done on this subject recently [U] . In [4], Hudson and Shepherd have proposed a Kalman closed-loop system for the RLS parallel weight extraction problem that consists of a systolic QRD postprocessor to compute the least squares weighting vector. However, the parallel RLS weight extraction system proposed is not efficient for very large scale integration (VLSI) hardware implementation. The major hurdles are that this system, which requires two modes to update the data and to freeze the updated data for computing the weight vector error at the same time, is inefficient in obtaining the instant weight vector recursively, and that the feedback configuration may create serious routing problems in VLSI implementation and roundoff noise accumulation
In [5] , McWhirter introduced a fixed parallelogram structure for the parallel weight extraction in the RLS problem. However, McWhirter's RLS weight extraction system does not efficiently update the weight vector since the array must be frozen at each snapshot to compute the weights.
In a recent paper [A, a numerically stable and computationally efficient algorithm for weight extraction for a constrained recursive least squares (CRLS) problem has been described by Schreiber. Although the algorithm shown in [7] has robust numerical properties, it is difficult to arrange the whole algorithm into a single fully pipelined structure as pointed out in [8, 91 . The difficulty, which is the same as that in the RLS case, arises because the CLRS algorithm consists of several steps particularly involving the backward substitution step. Many minimum variance distortionless response (MVDR) adaptive beamformers with CRLS systolic array [41.
IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 30, NO. 2 APRIL 1994 structures proposed in [%lo] are designed to avoid the extra backward substitution processor for computing the residual. Unfortunately, for the problem of paralleVpipelined weight extraction, very little has been done in implementing the CRLS algorithm into a single fully pipelined systolic array structure without requiring the backward substitution processors. Recently, Owsley developed an adaptive MVDR beamformer with a systolic array implementation using Schreiber's CRLS algorithm for weight extraction [ l l , 121. Nevertheless, Owsley's CRLS systolic array structure which consists of several block processors including the forward and backward substitution processors has been shown to be unpipelinable. Subsequently, ring, Liu, and Tretter [6, 131 presented systolic architectures for MSC and MVDR adaptive array systems. During the preparation of this work, the authors discovered a paper by Shepherd, et al. [14] in which they independently proposed a similar concept for RLS and CRLS algorithms.
In this paper, fully paralleVpipelined systolic arrays for the multiple sidelobe canceler (MSC) and MVDR adaptive weight extraction systems without the need for forward or backward substitution are described. The proposed MSC and MVDR systolic adaptive array systems have five advantages: 1) they are simple, modular, and expandable so as to be very suitable for VLSI hardward implementation, 2) the square-root-free Givens method can be easily incorporated into the architectures, 3) they are fully pipelined since the backward substitution is avoided, 4) they are open-loop systems without any feedback arrangement, and 5) they function recursively to update the instantaneous optimal weight vector since only one mode is required in the recursive updating.
two types of adaptive beamformers are introduced. They are the MSC formulated as a RLS problem and the MVDR beamformer formulated as a CRLS problem. In Section 111, the background and the new techniques to replace the forward and backward substitutions by the parallel multiplication and accumulation operation using systolic arrays are considered. In Section IV, the QR-based RLS algorithm for the MSC adaptive beamforming system without the need for backward substitution is described and its paralleVpipelined MSC weight extraction system with SAP is also presented. In Section V, the QR-based CRLS algorithm for the MVDR adaptive beamforming without the need for forward and backward substitutions is described and the paralleVpipelined MVDR weight extraction system with SAP implementation is also considered. In Section VI, the fast, square-root-free Givens method is employed for the MSC and MVDR adaptive array systems. This paper is organized as follows. In Section 11,
II. ADAPTIVE BEAMFORMING SYSTEMS
In this section, two popular adaptive beamforming systems are described. We first describe the MSC which consists of a main antenna and several auxiliary antennas. The multiple sidelobe cancellation technique is employed to suppress the sidelobe interferences and noises. The interferences and noises are estimated by multiplying the observed input data received through the auxiliary antennas by the adaptive weights and then summing them together. The interferences and noises are suppressed from the main radar channel by subtracting the estimates from radar main channel output. Second, an MVDR beamformer is considered. The MVDR beamforming technique is used to suppress the interferences and noises by constraining the response of the beamformer to specific directions and then minimizing the output power subject to the response constraints.
A. Multiple Sidelobe Canceler (MSC)
The block diagram of a MSC is shown in Fig. 1 . It is easy to see that the output at the ith snapshot can be expressed as
=1
A set of n successive snapshots can be represented in the vector form where y(n) is an n by 1 output vector matrix z(n) is an n by 1 desired input data vector matrix If n snapshots of input data are available, the output can be written in the vector form
where y(n), X ( n ) , and w(n) are as defined in (3), (9, and (6), respectively. with D being the distance between two sensors in the array, and X being the wavelength. Taking expectation of yH(n)y(n) yields
where M is the n by n covariance matrix of X ( n ) .
In general, designing a MVDR beamformer requires solving a constrained least squares optimization problem. The problem is 
where K is the number of look directions. Using the method of Lagrange multipliers, the optimal weight vector wLpt is found to be [15] In practice, M ( n ) used in (8) and (13) is the sample covariance matrix of the observed data X ( n ) , M ( n ) = XH(n)X(n) (14) rather than covariance matrix of X ( n ) used in (12).
Ill. SYSTOLIC ARRAY LINEAR ALGEBRA PROCESSING
In the emerging field of algorithmic engineering introduced by McWhirter [5] , the hybrid disciplines of designing numerically stable parallel algorithms suitable for parallel computation and mapping them onto VLSI systolic architectures to achieve high throughput rates and VLSI hardward implementation are demanded for sophisticated, high performance real-time modern signal processing. In real-time modern signal processing applications numerically reliable and computationally efficient algorithms and architectures for techniques such as RLS estimation, CRLS, solving linear systems, and performing singular value decomposition are required. Furthermore, in these applications it is also necessary to design a highly paralleVpipelined structure for the use in parallel supercomputers and for implementation by using VLSI systolic processors.
In this section some key processors are developed as the basic tools for designing sophisticated adaptive array systems. Moreover, the parallellpipelined techniques considered here make it possible to design more advanced adaptive array systems such as the MSC and MVDR adaptive beamforming systems studied here. A brief description of the key linear-algebra-based parallel algorithms with SAPs is provided in the following subsections.
A. Systolic Array for Preventing Forward Substitution
The computation of RHF = X is usually called the forward substitution where F is n x m matrix, R is n x n upper triangular matrix, and X is n x m matrix. In [16] it is carried out in two steps to obtain F by the Comon-Robert's algorithm when the matrices R and X are given. In the first step, the matrix R-' is computed when the matrix R is fed into the systolic array. In the second step, the vector XH is given as input to compute X H R -l . As a result, the complex conjugate of X H R -l is taken to obtain F.
Similar to [8, 171, the systolic algorithm (SA) shown in n b l e I requires only one step to generate F with the matrix R prestored in the systolic array and the matrix X as input fed into the array.
to obtain F is illustrated in Fig. 3 . The systolic parallelogram array processor is operated by sending X into the upper triangular systolic processor in parallel to obtain F [6] .
The systolic parallelogram array processor designed
B. Systolic Array for Preventing Backward Substitution
Backward substitution is required in many adaptive array algorithms. Assume the vector z and the lower triangular matrix R-H are the given data. The question now is how to design a systolic structure to compute b = R-lz, by using the given vector z and lower triangular matrix KH. Fortunately, this can be easily carried out on a systolic array efficiently. The technique introduced in this subsection shows that instead of solving Rb = z, the backward substitution, the forward substitution using matrix R-H is employed.
In Table 11 , the systolic algorithm for the parallel multiplication and accumulation operation of a given vector z and matrix to prevent computing backward substitution is described. The systolic array shown in Fig. 4 is designed by concurrently sending each element of the vector z to multiply the complex conjugate of each element of the matrix R-H and then by summing them together to obtain the vector b. It is clear that the vector b can be obtained by the parallel multiplication and accumulation operation of the data stored in the lower triangular part of the systolic array processor without performing the backward substitution [6] .
IV. QRD-MSC ADAPTIVE BEAMFORMER
Pipelined data-parallel algorithms that can be implemented on SAPs are called systolic algorithms (SAS). possess. The SAPS have additional nice properties such as simplicity, modularity, and expandability which are very suitable for implementations of adaptive beamforming systems onto VLSI. Conventional SMI methods generally lead to some undesirable numerical properties. In order to alleviate this difficulty, the QRD can then be used.
The QR-based MSC adaptive beamforming system is described in this section while the QR-based MVDR adaptive beamformer is considered in the next section. In this section we introduce a single and fully pipelined systolic parallelogram array processor for optimal weight extraction in the MSC adaptive beamforming system. The adaptive beamforming system requires two modes for initialization and only one mode for recursive updating to obtain the optimal weights. By using a QRD, we avoid computing the sample covariance matrix inversion as needed in (8) and (13). 
Combining (15) and (16), that use the QRD, we
Let us call this the mode 1 operation. Finally, using systolic parallelogram array processor described in Subsection IIIA, the initial lower triangular matrix F H ( N ) can be generated as in Table I and Fig. 3 . This operation is called the mode 2 operation. For n greater than N , it is known [18] that, in recursive updating, the unitary matrix Q(n) consists of two factors given by
where Q(n) is an n by n unitary matrix obtained from a sequence of Givens rotations that updates the new data row xT(n), and G ( n -1) is an n by n unitary matrix given by
TANG & ET AL.: OPTIMAL WEIGHT EXTRACTION FOR ADAPTIVE BEAMFORMING USING SYSTOLIC ARRAYS 371
Thus, when the unitary matrix Q(n) is applied to we have QW update the current data matrix X ( n ) , we have
where u(n -1) is an N x 1 vector, v(n -1) is an (n -N -1) x 1 vector, and # denotes an arbitrary value of no interest in mathematical and physical concept. Equations (19) and (20) update R(n -1) and u(n -1) once the data row xT(t,) and new desired data z(tn) are available. It is also necessary to update the lower triangular matrix F H ( n -1) for computing the instantaneous optimal RLS weighting vector. It can be seen that the same unitary matrix Q(n) can also be used to update the lower triangular matrix R-H(n -1) as given by
Therefore, from (21), we obtain
Combining the updating for the upper triangular matrix R(n -l), the orthogonalized desired vector u(n -l), and the lower triangular matrix R-H(n -l), (8) to give Equation (24) shows how to compute the optimal weight vector using QRD. However, the problem of designing a fully pipelined processor due to the backward substitution still remains. Since u(n) and R-H(n) are available, the technique described in Subsection IIIB can be used to realize the backward substitution. Equation (24) is then computed as follows:
where wL(n) is a 1 by N vector and T denotes transpose.
B. Systolic MSC Weight Extraction System
The systolic RLS weight extraction algorithm is summarized in Bble 111. The paralleypipelined weight vector obtained by this algorithm is defined only during the recursive updating for n 2 N when the observed data matrix is of full rank. We start with initializing the algorithm by setting the upper triangular matrix, the orthogonalized desired vector, and the lower triangular matrix to zero, i.e., R(0) = 0, u(0) = 0, and K H ( 0 ) = 0. In the initialization, mode 1 and mode 2 are required while only mode 1 is needed in recursive updating. When the observed input data matrix X ( N ) is available, N is the number of sensors. Then, the initial upper triangular matrix R ( N ) and the orthogonalized initial desired vector u ( N ) are generated by using the mode 1 operation, the QRD, and the lower triangular matrix R-H(N) are obtained by employing the mode 2 operation, the parallel multiplication and accumulation operation, as described in Subsection IIIA. Finally, for n 2 N, the optimal parallel weights are obtained by using only the mode 1 operation to update parameters and to carry out the parallel multiplication and accumulation instead of backward substitution. In Fig. 5 , the MSC systolic parallelogram array processor is illustrated for the case of four sensors to receive the observed input data and the desired input data, and to instantaneously generate the optimal updated weights in parallel. The system needs two procedures which are the initialization and the recursive updating. The initialization is further divided into two parts. First, under the mode 1 operation, the 3 x 3 observed input data X, the 3 x 1 desired input data z, and the 3 x 3 zero matrix are fed into the MSC systolic array to compute the 3 x 3 upper triangular matrix R and the 3 x 1 orthogonalized the processor element 1 generates the rotation coefficients c and s when zeroing out the observed input data. The processor elements 2 and 3 perform the rotation of the received input data according to the rotation coefficients. The processor element 4 not only perform the rotation but also carries out the parallel multiplication and accumulation operation to is employed to generate the lower triangular matrix desired vector U stored in the upper left part of the systolic parallelogram array processor. Secondly, under the mode 2 operation, the 3 x 3 identity matrix, 1 x 3 matrix of Is, and zeroes are sent into the processor to generate the 3 x 3 lower triangular matrix R-H which is stored in the lower right triangular part of the array processor. The upper left triangular processors perform the parallel multiplication and accumulation operation instead of forward substitution to generate the lower triangular matrix R-H when 3 x 3 identity matrix is received, and the lower right triangular processors function as the loading operation when 1 x 3 matrix of 1s is received. Finally, during recursive updating, the optimal weight vector is obtained in parallel under the mode 1 operation. When the 1 x 3 observed input data vector, the desired input data, and 1 x 3 zero vector are fed into the processor, the 1 x 3 updated optimal weight vector is obtained instantaneously at the bottom of the array. The four processor elements of the systolic array are given in Fig. 6 . The mode 1 operation for the MSC systolic array for each processor element is described in n b l e IV, and the mode 2 operation is presented in Dble V. The mode 1 operation is used to carry out the QRD and parallel multiplication in both initialization and recursive updating. Under mode 1 operation, operation is used to generate parameters while the processor 2 and 3 are employed to carry out the parallel multiplication and accumulation operation. The processor element 4 is simply used to store the lower triangular matrix. It is illustrated in Fig. 5 that the two different modes described in Table IV and V are used in the initialization. The recursive updating for the paralleVpipelined weight extraction used for n > N only requires mode 1 operation. Since the optimal weight vector obtained in the recursive updating only one mode, the parallel MSC systolic parallelogram array processor proposed is fully pipelined.
R-H(n)
.
V. QRD-MVDR ADAPTIVE BEAMFORMER
Since weight extraction in the conventional QR-based constrained recursive least squares (QRD-CRLS) algorithms requires both recursive orthogonalization and backward substitution [7, 11, 121 , it is impossible to update the whole system in a fully pipelined manner. The new QRD-CRLS technique proposed is able to obtain the optimal weights without performing backward substitution.
A. QR-Based CRLS Algorithm for MVDR Beamformer
In this subsection, a parallel and fully pipelined algorithm is introduced for weight extraction in a systolic MVDR beamformer. When the N-snapshot observed input data matrix X ( N ) is available, by applying the QRD to the observed data X ( N ) , the initial upper triangular matrix R ( N ) is given by When n is greater than N, the number of sensors, it has been shown in Subsection IVA that a QRD can be carried out recursively to update the optimal weights. The upper triangular matrix can be updated as before, i.e.
PR(n
To update a parameter vector s'(n -l), notice that
As described in the MSC system, the same unitary matrix used to update the upper triangular matrix R(n -1) can also be employed to update the lower triangular matrix K H ( n -l), that is
Updating for the upper triangular matrix R(n -l), the parameter vector s'(n -l), and the lower triangular matrix R-H(n -1) can be described by the combined
On substituting (B), (29) , and (33) into (13), the
MVDR weight vector w(', (n) becomes (34)

B. Systolic MVDR Weight Extraction System
A single fully pipelined structure is shown in Fig. 7 for the case of three sensors with one constraint. Another single fully pipelined structure is shown in Fig. 8 for three sensors with multiple constraints. There are five processor elements as shown in Fig. 9 for our proposed fully pipelined structure. Both parallel weight extraction MVDR systems require two procedures: the initialization and recursive updating. The initialization can be further divided into two modes. Under the mode 1 operation, the 3 x 3 upper triangular matrix B is generated and stored in processor elements 1 and 2, when the 3 x 3 input observed matrix and the 4 x 3 zero matrix are received. Then, under the mode 2 operation described in Subsection IIIA functioned as the parallel multiplication and accumulation operation instead of forward substitution, the 3 x 1 initial parameter vector s and the 3 x 3 lower triangular matrix R-H are obtained and stored in processor elements 3 and To demonstrate how the paralleVpipelined weight extraction system functions, the summary of the whole system to obtain the optimal weight vector for MVDR adaptive beamforming is described in Tmble VI. The mode 1 operation and the mode 2 operation for each processor element are given in Table VI1  and Table VIII . Under the mode 1 operation, the processor element 1 generates the rotation coefficients c and s, while processor elements 2, 3, and 4 rotate the received data. The processor elements 3 and 4 also performs the multiplication-and-accumulation operation to compute the normalization coefficient and optimal weight vector before normalization. The processor element 5 performs the normalization for the optimal weights. Under mode 2 operation, the processor elements 1 and 2 perform the parallel multiplication and accumulation operation without exactly computing the forward substitution while the remaining processor elements operate as temporary storage for the parameter vector and the lower triangular matrix. The system is started by setting the whole parallelogram array processor to zero, i.e., Two different modes described are employed in the initialization. The upper triangular matrix R ( N ) is generated by using mode 1 operation while the initial parameter vector s ( N ) and the initial lower triangular matrix K H ( N ) are obtained by using mode 2 operation described in Subsection IIIA. In recursive updating, the systolic parallelogram array processor uses mode 1 operation to update and compute the optimal weights in parallel during time n > N .
VI. FAST GIVENS-BASED ADAPTIVE BEAMFORMING SYSTEMS
It is important in implementing the QR decomposition by VLSI to avoid the square root [8, 181. Since computation of the square root is complicated, it limits the through-put of the VLSI processors. To speed up the system, a square-root-free version of the Givens method is essential and is referred to as the fast Givens method by Gentleman [19] and Hammarling [20] .
A. Fast Givens-Based RLS Algorithm for MSC
The upper triangular matrix R(n) can be rewritten as 
Note: Underlined symbols here correspond to boldface symbols in text.
lower triangular matrix E H ( n ) as follows generated without explicitly computing
and --H where R'-H(n) = D-'(n)R be readily seen that the L S weight vector can be computed without a square root as (n). Then, it can
where D'12 is never explicitly computed. In the initialization the upper triangular matrix R ( N ) , and orthogonalized desired vector ri(N) can be -
Similar to the last two sections, the initial lower triangular matrix R-H(N) can also be obtained easily.
In order to update the optimal weight vector recursively, an upper triangular matrix K ( n -I), a vector ri(n -l), and a lower triangular matrix R (n -1) must be updated at each new data sample. The following equation shows how to update all of the parameters together where # denotes an arbitrary matrix or vector with no special interest. Finally, the optimal weight vector is then computed by
The square-root-free algorithm and its SAP are described in Table IX and Fig. 10 . The proposed fast MSC weight extraction system consists of four processor elements as given in Fig. 11 . In Tmble X, the mode 1 operation performs the fast Givens rotations and parallel multiplication without computing backward substitution. In addition, under the mode 1 operation, the PE 1 is used to generate parameters without computing the square root, the PE 2 and 3 are operated to transform the input data by those parameters, and the PE 4 is employed not only to transform input data but also to carry out the accumulation and addition operation for obtaining the optimal weights. In Table XI , the mode 2 operation functions as the parallel multiplication and accumulation operation without the need for forward substitution. The functions of the four processor elements in the fast MSC SAP are performed in the same way as in the MSC systolic array processor described in Section IVB.
B. Fast Givens-Based CRLS Algorithm for MVDR
Similarly, for MVDR beamforming we can factor
(n)s(n), and K H ( n ) = D'/2(n)R'-H(n), where R'-H(n) = D-1(n)3-H(n).
All the parameters must also be updated for computing the optimal weight vector. The following equation shows how all the parameters can be updated together -The fast Givens CRLS algorithm and SAP are described in Table XI1 and Figs. 12 and 13. There are five processor elements in the systolic array. Processor elements are illustrated in Fig. 14 and the functions of each processor element under the two modes are also described in Tables XI11 and XIV. VII. CONCLUSION
In this paper, we consider the QRD-based RLS and CRLS problems for MSC and MVDR weight extraction systems. Both weight extraction systems require two modes for initialization and one mode for recursive updating. Since the optimal weight vectors obtained for MSC and MVDR beamformers are defined solely in recursive updatings with only one mode, it is shown that the weight extraction systems proposed are fully pipelined to compute the optimal weights instantaneously. Compared with the conventional weight extraction system involving the forward and backward substitutions which may lead to significant obstruction in designing a fully pipelined systolic architecture to update the optimal weights instantaneously, our proposed systolic parallelogram structure is very promising for VLSI implementation. Furthermore, the fast Givens method without square root operation is employed to improve the throughput for parallel weight extraction systems and to increase the operational speed for MSC and MVDR adaptive array systems. [IO] Yang, B., and Bome, J. E (1988) Systolic implementation of a general adaptive array processing algorithm.
In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, April 1988 April , 2785 April -2788 
