Introduction
Space-time adaptive processing (STAP) refers to a class of signal processing techniques used to process returns of an antenna array radar system [4] . STAP algorithms are designed to extract desired target signals from returns comprised of Doppler shifts, ground clutter, and jamming interference. STAP simultaneously and adaptively combines the signals received on multiple elements of an antenna array -the spatial domain -and from multiple pulse repetition periods -the temporal domain.
The output of STAP is a weighted sum of multiple returns, where the weights for each return in the sum are calculated adaptively and in real-time. The most computationally intensive portion of most STAP approaches is the calculation of the adaptive weight values. Calculation of the weights involves solving a set of linear equations based on an estimate of the covariance matrix associated with the radar return data.
Existing approaches for STAP typically rely on the use of multiple digital signal processors (DSPs) or general-purpose processors (GPPs) to calculate the adaptive weights. These approaches are often based on solving multiple sets of linear equations and require the calculation of numerous vector inner products. This paper proposes the use of FPGAs as vector coprocessors capable of performing inner product calculation.
Two different "inner-product co-processor" designs are introduced for use with the host DSP or GPP. The first has a multiply-and-accumulate structure, and the second uses a reduction-style tree structure having two multipliers and an adder.
STAP Weight Calculation

Basic Formulation
The STAP algorithm assumed here is known as K th -order Doppler factored STAP, which is classified as a partially adaptive technique. Due to the space limitation, it will not be possible to fully explain this algorithm. Instead, the focus here will be on the necessary notation and core calculations required to determine the values of the adaptive weights. For more information on STAP, the reader is referred to [1, 4] .
Determining the values for the n-vector of adaptive weights, denoted by w r , involves solving a system of linear equations of the form: 
=
. Based on this the definition, Ψ is given by:
QR-Decomposition and Conjugate Gradient
The QR-decomposition approach is a direct approach for solving a system of linear equations. The QR approach always gives an exact solution and the complexity of the algorithm is fixed. It involves performing a QR-decomposition on the matrix X T , the result of which is an N N × orthogonal matrix Q and an N n× upper triangular matrix R such that X = QR. The final result is obtained by forward and backward substitution. For more details the reader is referred to [1] .
The conjugate gradient approach is an iterative method that provides a general means for solving a system of linear equations [2] . For the system of equations given in Eq. 
The function f is minimized when its gradient is zero, i.e.,
, which corresponds to the solution to the original system of linear equations. The very repetitive and regular numerical structure of the conjugate gradient update equations makes it a prime candidate for implementation on an FPGA system. Numerical studies were conducted using Matlab implementations of the QR-decomposition and CG methods on actual STAP data collected by the Multi-Channel Airborne Radar Measurement (MCARM) system of Rome Lab [3] . Further details of this study can be found in [6] .
Inner-Product FPGA Co-Processor
Each of the two methods outlined above requires calculating a number of inner products. Given enough resources, all the inner products could be done in parallel on FPGAs. Because the available system has only two FPGAs [5] , the computations was divided among the host processor and the FPGA board. The two schemes that were implemented are outlined below. For both schemes, the data vectors are assumed to be in block-floating-point format [9] . Additionally, the multiplier implementation is based on discussion in [7] and the adder unit uses 4-bit carry-look-ahead adders [8] in each stage of the adder pipe.
Multiply-and-Accumulate Implementation
In the first implementation shown in Figure 1 , the FPGA is configured to perform the multiply-and-accumulate operations on the input vectors. The implementation consists of a multiply unit and an accumulator, which is composed of a normalization unit and an adder. The normalization unit shifts the binary point of the mantissa and makes a compensating adjustment to the exponent prior to the addition. The output of the adder is fed back and accumulated with the next product term.
The single cycle multiply-and-accumulate is achieved by pipelining each unit of the implementation. This unit reads in two operands and performs two operations per cycle. Thus, the unit reduces two N-vectors to a constant number of partial sums equal to the number of stages in the accumulator pipe. The implementation allocates approximately 88% of the configurable logic blocks (CLBs) on the Xilinx 4028EX FPGA. The implementation can be clocked at 40MHz, thus giving a throughput of 80 million block-floating-point operations per second. Figure 2 illustrates the second implementation that performs an inner product, i.e., a multiply-and-add operation on the two input vectors. The design incorporates two 16-bit multiply units and an adder. By using this approach, two multiplies can be performed in parallel, and afterwards, the adder computes the sum of the two products.
Multiply-and-Add Implementation
A challenge associated with this implementation is that four 16-bit input operands, i.e., 64 bits, are required per computation cycle. Unfortunately, the data-path to the FPGA board is only 36-bits wide. The solution to this problem involves clocking the input state machine at twice the frequency of the multiply-andadd state machine, and registering the first two operands for one input state machine clock cycle.
The multiply-and-add unit reads in four operands and performs three block-floating-point operations per cycle. Thus, the two input N-vectors are reduced to an N/2-vector of partial sums. This implementation, however, involves an additional N/2 addition operations to obtain the inner product result. For this implementation, approximately 99% of the available CLBs on the Xilinx 4028EX FPGA are required. In summary, for a fixed clock rate, the second design can provide a higher throughput, but requires more computation from the host (to perform the final summation of the partial sums). 
