Architectures for adaptive weight calculation on ASIC and FPGA by Walke, R.L. et al.
Architectures for Adaptive Weight Calculation on ASIC and FPGA 
R L Walke, R W M Smith 
DERA (Malvern), St. Andrews Road, Malvern, WR14 3PS, UK. 
G Lightbody 
DSiPrM Laboratories, Queens University of Belfast, Northern Ireland, UK. 
email: walke@signal.dera.gov.uk 
Abstract 
We compare two parallel urray architectures for adap- 
tive weight calculation based on QR-decomposition by 
Givens Rotations. We present FPGA implementations of 
borh orchitectures und compare them with un ASIC-bused 
solution. The throughput of the FPGA implementations is 
of the order 5-20 GigaFLOPS, making FPGA a viable al- 
ternative to ASIC implementation in applications where 
power consumption and volume cost ure not critical. 
1. Introduction 
The process of adaptive beamforming enables the beam 
pattern of an array of antennas to be shaped to counter in- 
terference arriving from directions other than that of the 
wanted signal[ 11. The subsequent performance benefits are 
often sufficient to justify multiple antennas and their re- 
ceivers, and the technique is now commonly employed in 
radar, sonar and more recently communications systems. 
In radar applications the data-rate is of the order of MHz 
and the sub-process of calculating the adaptive weights for 
a particular environment can be a very computationally de- 
manding 'task. Therefore, its efficient implementation is a 
worthwhile topic of investigation. 
RecJrsive least squares by QR decomposition is a well 
established technique for solving the least-mean-squares 
problem at the heart of adaptive weight calculation for 
both beamforming and filtering applications[2]. Good nu- 
merical performance is achieved by performing the algo- 
rithm using Givens rotations which allows the use of 
reduced wordlength arithmetic for efficient implementa- 
tion on both ASIC and FPGA. Furthermore, a highly paral- 
lel triangular array processor architecture, known as the 
QR-array, exists that allows a very-high throughput imple- 
mentation to be achieved by employing a large number of 
processors[3]. 
In reality, the number of processors used by the QR-ar- 
ray solution is too high and the throughput far in excess of 
the system requirements. To address this either the array is 
mapped to a reduced number of time-shared processors in 
a linear array, or alternatively, the triangular architecture is 
maintained and the processors themselves are constructed 
from time multiplexed arithmetic units (e.g. digit serial 
arithmetic). In this paper we adopt the former approach 
and consider two different mappings from the triangular 
QR-array to a linear one. We adopt this approach to pro- 
vide a greater range of throughput options and to avoid the 
inefficiency that may arise from time-multiplexing at the 
arithmetic operator level within the processor. 
The QR-array requires two types of operation, often re- 
ferred to as boundary and internal cell operations. The 
mappings can be differentiated in the way cell operations 
are mapped to either dual operation or to distinct single op- 
eration processors. As such, we refer to them here as mixed 
and discrete mappings respectively. Both offer 100% or 
close to 100% utilisation over a broad range of problem 
size. 
The mixed mapping was proposed by Rader[4] and 
adopts CORDIC (Coordinate Rotation by DJgital Compu- 
ter)[5] operators to realise the functionality of both bound- 
ary and internal cell operations. 
The discrete mapping was developed by the authors to 
enable a whole range of algorithms based on standard 
arithmetic operators to be employed[6]. With these algo- 
rithms the function of the cells is quite different and the 
discrete mapping allows processors to be optimised for 
one or other cell operation. Algorithm variants have been 
developed to avoid square-roots and divisions, reduce op- 
eration count and allow fixed-point arithmetic to be em- 
ployed. We use the Squared Givens Rotation (SGR) 
variant[7] here as it offers low operation count and is 
square-root free. It requires floating-point arithmetic but 
we only require low mantissa wordlength. 
In this paper we explore the differences between the 
two approaches by considering FPGA implementations of 
both. We start by giving a brief overview of adaptive 
beamforming in radar systems in section 2.  In this section 
we also include some numerical simulation results to es- 
1375 
0-7803-5700-0/99/$10.000 1999 IEEE 
tablish wordlength requirements for our two algorithms, as 
wordlength is critical to the size of our implementations. A 
brief overview of the two mappings is presented in section 
3, followed by their FPGA implementations in section 4. 
We contrast the two approaches in section 5 and compare 
with an ASIC approach. We present our conclusions in 
section 6. 
1 CORDE 1 -- 
2. Adaptive beamforming in radar systems 
Figure 1 provides an overview of an adaptive beam- 
former in which the outputs of an array of antennas are 
combined in a way which places nulls in  the direction of 
interference whilst maintaining a high gain in the direction 
of interest. The antenna outputs are down-converted to 
base-band frequencies and digitised. The weights, wi, to 
create a particular beam are calculated over a block of in- 
put data and then applied to this data via an array of com- 
plex multipliers to give the beamformed output. 
peceiver t+&- 




salt , , IWeight 
Weight calculation applicati 
constrants back-subslitutlon 
welsh1 Ilushing, 
I b a n 1  b m n  
Figure 1:  Adaptive beamforming system architecture 
In the proposed system QR-decomposition is used to 
obtain the weights in a three-stage process. The first step is 
to decompose a minimum of 2p input samples of the input 
data, referred to collectively as X, into an upper-triangular 
matrix R and vector U. This captures the interference envi- 
ronment. To these are applied constraints for each beam, 
which stabilise the beam pattern and set a particular look- 
direction. The resulting Ri and ui are related to the weights 
by Riwi=ui, and the weights are obtained by either back- 
substitution or weight-flushing (latter shown in figure). 
Back-substitution requires a separate processor but is com- 
putationally efficient, whereas weight flushing reuses the 
same hardware, but requires many more operations. 
2.1 Wordlength requirements 
The size of our FPGA or ASIC implementation is de- 
pendent upon the square of the wordlength (at least to a 
first approximation). Therefore, it is important to deter- 
mine the minimum wordlengths from system simulations. 
In the adaptive beamformer application, the wordlength 
has to be sufficient to calculate the weights to an accuracy 
that allows full cancellation of the interference. To put this 
in perspective, the input contains 3 components: signal, in- 
terference and thermal noise (the latter from receiver com- 
ponents). The interference and signal are usually measured 
relative to thermal noise, where the signal is generally 
smaller and the interference much larger than thermal 
noise. The scaling of the ADC input is usually arranged so 
that thermal noise toggles the last bit or so, and has suffi- 
cient range to digitise interference without significant 
probability of saturation. 
The task of the adaptive beamformer is to suppress in- 
terference to thermal noise levels. Therefore, the weights 
must be calculated with sufficient accuracy to do this. For 
60dB interference (i.e. a voltage 1000 times greater than 
thermal noise), approximately IO-bits of accuracy are re- 
quired to do this. 
The accuracy of the weights depends upon the nature of 
the error and how it accumulates. Figure 2 shows the sig- 
nal to interference+noise ratio (SNIR) for a range of word- 
lengths, for both the SGR and CORDIC algorithms. The 
number of antennas is 32 (i.e. p=32).  At low wordlengths 
the SNIR is dominated by arithmetic errors, but as word- 
length is increased the SNIR improves until it becomes 
dominated by the thermal noise (here, after post-process- 
ing, the maximum SNIR is 17dB) and there is no benefit in 
increasing the wordlength further. 
1376 
a more rapid growth of errors in the accumulated R and U 
terms (it grows with the number of operations n rather than 
& as in the floating-point SGR implementation). The 
larger the problem size, p ,  the greater this error. Rounding 
has been employed within the CORDIC operations to re- 
duce error and bias, however some stili remains. 
3. QR-array mappings 
3.1 QR-array 
Figure 3 shows a 7 input QR-array (i.e. p = 6)[3] uti- 
lising p@+3)/2=27 processors. The operation of the two 
processor types is shown in the insets. The antenna data 
enters from the top and progresses down the array. On 
each row the leading term is eliminated by a 2-d rotation 
between cell input and the elements of R and U which are 
stored within each cell. 
Figure 3: Systolic OR-Array Processor 
The number of operations rapidly grows with problem 
size. However, techniques for mapping large arrays of op- 
erations onto a reduced number of processors are well es- 
tablished [8][9]. We consider two mappings which have 
been derived for the QR-array. 
3.2 Mixed mapping 
Figure 4 shows how the operations of the triangular ar- 
ray may be mapped onto dual function processors that are 
fully utilised. This is done by first moving the bottom por- 
tion of the array to give the same number of operations on 
each row in the resulting array. The operations are then al- 
located onto a linear array of processors. In this case two 
rows of operations are mapped onto each processor giving 
a sparse solution employing only 2 processors. Each proc- 
essor performs the same number of operations (in this case 
18) and so is fully utilised. 
The order of execution of the operations on each proc- 
essor is along each row from left to right, and rows top to 
bottom. Note that relocated operations will compute the 
tail-end of earlier QR-updates (to provide time for x-values 
to propagate down and then back up the processor array), 
with the effect that QR-updates are interleaved. 
Folded QR-array Sparse linear array 
5.6-5.7 - 5.8 
L J L  J L ,  
$:+!<+!< /) 
Two rows mapped 
onto each processor 
J i  \7>*7,< 
& 
Figure 4: Mixed mapping 
3.3 Discrete mapping 
The need for dual-function processors is avoided by the 
discrete mapping shown in Figure 5.  The array is obtained 
by first moving the lower portion of the array to the top, 
and then folding it to interleave operations. This places the 
same number of operations in each diagonal and ensures 
that local processor inter-connections are obtained when 
the operations are projecting down the diagonal onto a lin- 
ear array of processors, as shown in Figure 5 
(a) Bottom portion moved to 
lop of array & 
\ 
(b) Array folded 
about z-axis lo 
interleave cells. 
A sparse linear array 
solution may be obtained 
by mapping multiple 
diagonals on internal 
cells onto processors. 
Boundary cell will be 
under-utilised. 
projected down z 
onto linear array of 
processors 
I 
Figure 5: Discrete mapping 
As with the mixed mapping, the relocated operations 
will finish the tail-end of an earlier QR-update. 
4. FPGA implementations 
We generate FPGA implementations of the two archi- 
tectures from a hierarchy of structural VHDL descriptions. 
These are parametrised for wordlength, and include at- 
tributes to provide placement information to achieve very 
dense layout with predictable timing. 
1377 
4.1 CORDIC - mixed mapping 
CORDIC implements a 2-d rotation directly using a se- 
quence of sub-rotations which can themselves be imple- 
mented by a sequence of shift and add/subtract operations. 
This maps very well onto Xilinx Virtex FPGAs as they im- 
plement fast carry-propagate adder/subtractors very effi- 
ciently. The carry propagation delay is so small relative to 
other delays that there is little speed advantage in using 
more complex redundant arithmetic adders such as signed- 
binary or carry-save. 
Rader[4] showed how complex Givens rotations, can be 
implemented using 3 CORDIC rotations as shown in Fig- 
ure 6 .  (See [5] and [4] for CORDIC operator details). 
' 
b) Rotalions in c) Linear array I 
boundary cell made 
i) Pre-rotation lo make 
make real-part of x positive 
ii) Rolation to set 
imaginary part to zero. 
O) shin 
P,', X J  
iii) Rotation to update r 
r readaut X '  
Figure 6: CORDlC processor 
The processor is switched between boundary and inter- 
nal cell operation as dictated by the schedule. In boundary 
cell mode the first CORDIC rotates the vector formed 
from the real and imaginary parts of the input, x ,  to set its 
imaginary part to zero. The real part is now processed as 
per Figure 3 i.e. it is rotated to zero against the stored r- 
term, and hence there is no x output in this mode. This re- 
quires only one 8-CORDIC, but the rotation is repeated in 
the imaginary datapath to generate rotation controls which 
can be stored locally for speed. The pre-rotation, @-rota- 
tion and rotation are then repeated in subsequent internal 
cell operations, in which both %rotation CORDICs are re- 
quired as the imaginary part of r may now be non-zero. 
Each CORDIC output must be scaled by a constant l / K  
to give a true circular rotation. This is avoided in  our im- 
plementation by scaling the array input, xin, up by an addi- 
tional factor K on each iteration (i.e. K" on the n-th 
iteration) to match the CORDIC scaling on the R and U 
terms. No correction is required on the x terms as the scal- 
ing is the same on each row, and so makes no difference to 
the final weights. To avoid overflow of both stored and x 
terms an occasional 1-bit shift-right operation is applied to 
the CORDIC outputs. The input scaling and shift control 
signals are pre-computed for a particular block length and 
stored in RAM. 
Figure 7 shows the layout of a 4 processor array. Clear- 
ly visible are the 3 CORDIC blocks employed by each 
processor. The input scaling multiplier has not been in- 
cluded here, and would be combined with other pre- 
processing. The maximum clock rate is 108 MHz. 
Figure 7: CORDIC QR processor on XCVlOOOBG560-6 
4.2 Squared Givens Rotations - discrete mapping 
and internal cell operations with the SGR algorithm. 
Figure 8 shows a signal flow graph for the boundary 







XdOJ X d d  
Figure 8: Squared Givens Rotation Algorithm 
The loop to update the R and U quantities consists of a 
simple adder. This has two advantages. Firstly, the word- 
length of the adder can be increased to improve the accura- 
cy to which R and U are accumulated over long runs of 
data. Secondly, with appropriate input scaling the adder 
can be made fixed-point and R and U terms updated on 
every clock cycle (assuming a processor for each cell in 
the array i.e. the QR-array is used). Therefore, very high 
sample-rate operation in excess of lOOMHz is possible. 
Figure 9 shows the implementation of 1 boundary and 2 
internal cell processors on a Virtex FPGA. All operators 
are fully parallel and pipelined, accepting new operands on 
1378 
each clock cycle. The maximum clock rate is 120MHz. 
Beams 
I 
Figure 9: SGR implementation on XCVlOOOBG560-6 
Algorithm 
CORDIC 19.1 I22 994 7,415 19.1'48.9 283 2,224 
SGR, FPGA 27.8 213 1,673 13,244 5.5.5 42.7 335 2.649 
16 32 64 128 16 32 64 128 
XCV 1000-6 
(0.22pm) FPGA 





SGR I 100 I 21 I 32.900 1 190 1 74 1225,000 1 
a.  CORDIC is fixed-point. This is the equivalent number of FLOPS. 
b. Results based on 100% boundary cell utilisation. For small numbers of 
processors and large problem sizes the boundary cell processor will be under- 
utilised. For example. if p=16 then aCNal throughput would be 4,170 
MFLOPS. 
'lock Processors MFLOPS 'lock Processors MFLOPS 
IO8 4 6,372" 135 12 25,245 
120 1 3 5.16oh I 150 I 9 I 20,850 
(MHz) (MHz) 
b)  Weight update period: A more useful interpretation 
of throughput to the system designer is the weight update 
rate. This is summarised in Table 2 for a range of problem 
size p ,  and is based on 2p data samples, followed by p + l  
constraint inputs for each beam. 
Table 2: Weight-update period estimates 
ASIC, 0.35pm 
I Weight-update period (p) I 
ASIC, 0. I8pm 
FPGA I XCVlOOOBG560-6 I Future: XCV3200E I 
- 1  I Number of antennas 01) INumber of antennas (p)I 
I I I I 
a. Rate limited by r-loop delay (discussed later), null operations inserted. 
c)  Latency: The time between applying the input and 
completing an update of R & U depends upon processor la- 
tency and the mapping. In Table 3 we summarise the laten- 
cy in the result for a range of problem size. 
Table 3: Latency in obtaining the R and U matrix 
I Latency (p.) I 
Algorithm 
SGR I00 
8. Figures given for 4 processors. 
b. Latency could be decreased to 39 by designing a 3 input adder. 
In the mixed mapping, the data-flow is both down and 
up the array. The latter is against the order of execution of 
the processors and can introduce large latencies. Further 
latency is introduced when the outputs from one row of 
cell operations are not produced in time for the scheduled 
operations. In this case the processor outputs must be de- 
layed until the next sequence of operations and the latency 
is increased accordingly. 
The latency of the discrete mapping does not depend 
upon the number of processors. It is slightly larger for 
small problems size, p ,  but grows more slowly with p .  
d )  I10 requirements: For multi-chip implementations 
the number of interconnections is of interest. Table 4 sum- 
marises the number of bits required per chip. The discrete 
mapping has a relatively high U 0  requirement and for the 
linear array it  would be necessary to transmit 8*2*20=320 
bits on every clock cycle (i.e. 4.8GByteds). This is well 
within the bounds of latest FPGAs. Bandwidth is reduced 
when sparse linear arrays are employed 
1379 




when combined with technology improvement is likely to 
yield capabilities far in excess of current architectures. 
Bus size Core 110 Input x R,u readout Total bits 
26 2*2*26(x) usescoreI/O 2*26 1.56 
20 6*2*20(e,~)  2*2*20 2*20 360 
e)  Maximum throughput: This is achieved using the 
maximum number of processors. The full QR-array offers 
this, but requires that R and U matrices be updated on eve- 
ry clock cycle and therefore only a single clock cycle delay 
is allowed around the Rlu update path. If this loop delay is 
greater than this then the number of processors that can be 
employed is reduced. Loop delay is dependent upon the la- 
tency of the operations used to update R & U. The maxi- 
mum number of processors is summarised in Table 5 for a 
range of problem size. 
Table 5: Maximum number of processors 
I Maximum number of Drocessors I 
Flour 
Algorithm 1 Adder I Delay I 4 1 8 1  16 I 32 
CORDIC I N/A I 22 I I I 2 1  7 1  24 
11 1+1 1+3 2+11 3+46 
I I I I I I I 1 
[ SGR,ASIC I Fldut I 3 I 3+2 I 3+10 1 5+60 1 11+165] 
a. Wordlength of adder increased to maintain numerical performance and if 
necessary redundant adder can be used to maintain single-cycle delay. 
The large loop delay of CORDIC can seriously limit the 
maximum number of processors that can be usefully em- 
ployed on a problem. Using a fixed-point representation of 
R and U in either the FPGA or ASIC implementations of 
the SGR algorithm gives a single-cycle delay and com- 
pletely avoids this limitation. 
f> Power Consumption: Power consumption is estimat- 
ed to be of the order of 10-20W per chip for both ASIC 
and FPGA. Therefore, an estimate of throughput per Watt 
can be derived from throughput per chip (i.e. Table 1). The 
ASIC solution offers a factor of between 5-10 better 
throughputNatt over FPGA. 
6. Conclusions 
The latest FPGAs offer a means to meet the perform- 
ance requirements of adaptive beamforming systems with- 
out adopting ASICs, providing low volume cost and low 
power consumption are not required. With Virtex 
XCV3200E, due out next year, i t  has been estimated that a 
performance of 20 GigaFLOPS should be possible. This is 
over an order of magnitude greater than programmable 
DSP. 
The rate of improvement in FPGA technology is stag- 
gering. Like DRAMS they take full advantage of the mas- 
sive number of transistors offered by current fabrication 
technology, and these can be exploited by suitable parallel 
algorithm implementations. For DSP implementation we 
mixed mapping and floating-point SGR algorithm. It has 
better numerical performance than our CORDIC imple- 
mentation and uses well established floating-point opera- 
tions. At present, it consumes more chip area, but looking 
to the future, has more scope for optimisation both at the 
algorithm and implementation levels. Also, floating-point 
cores are likely to become standard, freely available and 
updated as FPGA technology progresses. Therefore, our 
array designs should be more portable. There is also more 
acceptance of designs by system designers based upon 
conventional floating-point operations. 
The SGR implementation has high inter-processor YO. 
This is only an issue for multiple FFGA solutions and is 
supportable by current FPGAs. Furthermore, with future 
FPGAs, multi-chip solutions are only likely to be required 
in the most demanding of applications. 
7. References 
[ I ]  A. Farina, Antenna-Based Signal Processing Techniques for 
Radar Systems, Artech House, 1991. 
[2] S. Haykin, Adaptive Filter Theory, 2nd Edition, Prentice Hall, 
[3] W. M. Gentleman and H. T. Kung, “Matrix triangularization 
by systolic arrays”, Proc. SPIE 298, Real-Time Signal 
Processing N, pp. 19-26, 1981. 
[4] C. M. Rader, “VLSI Systolic Arrays for Adaptive Nulling”, 
IEEE Sig. Proc. Mag, Vol. 13, No. 4, pp. 29-49, 1996. 
[ 5 ]  J .  Volder, ‘The CORDIC Trigonometric Computing Tech- 
nique”, IRE Trans. Electron. Comput., Vol. EC-8, pp. 330-334, 
1959. 
[6] G. Lightbody, R. L. Walke, R. Woods, J .  McCanny, “Novel 
Mapping of a Linear QR Architecture”, Proc. ICASSP, vol. IV, pp. 
[7] R. Dohler, “Squared Givens Rotations”, IMA J .  of Numeri- 
calAnalysis,Vol. Il,pp. 1-5, 1991. 
[8] S.  Y. Kung, VLSIArruy Processors, Prentice Hall, I S B N  0-13- 
942749-X, 1988. 
[9] G. M. Megson, An Introduction to Systolic Algorithm Design, 
Clarendon Press, I S B N  0-19-853813-8, 1992. 
8. Acknowledgements 
ISBN 0-13-013236-5, 1991. 
1933-6, 1999. 
This work was carried out as part of Technical Group 10 of 
the MOD Corporate Research Programme. I would like to ac- 
knowledge the work by Alex Jackson on the implementation of 
the CORDIC QR-array processor and Chris Booth on the float- 
ing-point divider. 
0 British Crown Copyright 1999. 
Published with the permission of the Defence Evaluation and 
Research Agency on behalf of the Controller HMSO. 
1380 
