FPGA-based Low Latency Inverse QRD Architecture for Adaptive Beamforming in Phased Array Radars by Irfan, R. et al.
RADIOENGINEERING, VOL. 26, NO. 3, SEPTEMBER 2017 851 
DOI: 10.13164/re.2017.0851 SYSTEMS 
FPGA-based Low Latency Inverse QRD Architecture  
for Adaptive Beamforming in Phased Array Radars 
Raafia IRFAN1, Haroon ur RASHEED1, Waqas Ahmed TOOR2 
1 Dept. of Electrical Engineering, Pakistan Inst. of Eng. and Applied Sciences (PIEAS), Nilore, Islamabad-45650, Pakistan 
2 Dept. of Electrical Engineering, Capital University of Science and Technology (CUST), Express Way, Zone 5, 
Islamabad-45650, Pakistan 
fac2013.pieas@gmail.com 
Submitted November 17, 2016 / Accepted July 27, 2017 
 
Abstract. The main objective of this paper is to facilitate 
the adaptive beamforming which is one of the most chal-
lenging tasks in phased array radars receivers. Recursive 
least square (RLS) is considered as the most well suited 
adaptive algorithm for the applications where beamform-
ing is mandatory, because of its good numerical proper-
ties, high convergence rate and low misadjustment. In this 
paper, some RLS variants are discussed and the most nu-
merically suitable algorithm Inverse QR decomposition 
(IQRD) is selected for efficient adaptive beamforming. 
A novel architecture for IQRD RLS is also presented, 
which offers low latency and low area occupation for Field 
Programmable Gate Array (FPGA) implementation. This 
approach reduces the computations by utilizing the stand-
ard pipelining methodology. Hence, efficient adder and 
multipliers and Look Up Table (LUT) based solution for 
square root and division, has highly enhanced the perfor-
mance of the algorithm. The proposed IQRD RLS archi-
tecture has coded in Verilog and analyzed its performance 
in terms of throughput, hardware resources and efficiency.  
Keywords 
Adaptive filters, beamforming, phased-array radars, 
Recursive Least Square (RLS), Inverse QR Decom-
position (IQRD), Field Programmable Gate Array 
(FPGA) 
1. Introduction 
Radar signal processing is one of the active area of 
research due to its vast applications in radar systems like 
target tracking, navigation. Most of the radar systems are 
composed of multiple transmitting and receiving phased 
array antennas, known as phased array radars. These 
phased array antennas are designed to transmit and receive 
the spatially propagating signals and to focus on the arriv-
ing direction of desired signals as well as suppressed the 
interfering signals [1], [2]. Beamforming is an important 
signal processing technique in phased array radars for 
directional signal transmission or reception [2]. Figure 1 
shows a generalized block diagram of four-channel adap-
tive beamformer receiver having a uniform linear array 
structure (ULA) and an adaptive weight calculation (AWC) 
unit. In ULA, phased array radars, all the antenna elements 
are arranged along a line with uniform spacing s, to attenu-
ate interferences coming from different directions along 
with desires signal [3]. AWC unit is based on an over de-
termined system of equations in many cases and various 
algorithms has been developed to solve these equations. 
Such a system is also known as adaptive antenna system. 
AWC usually based on gradient based adaptive algorithms 
like Least Mean Square (LMS), Normalized Least Mean 
Square (NLMS), Recursive Least Square (RLS), Kalman 
filter, assuming all the input signals to ULA system are 
uncorrelated with each other [3], [4]. But, RLS is consid-
ered as the most commonly used algorithm as it has the fast 
convergence rate to find an appropriate solution and good 
numerical properties, but it has complex mathematical 
operations like matrix inversion. Hence, it limits its appli-
cation in real time systems [5]. In numerical methods, 
a well-known method known as QR decomposition (QRD) 
is used to decompose any matrix X into orthogonal matrix 
Q (QQT = I), and upper triangular matrix R (X = QR). So, 
to overcome the computational complexity, various QR de-
composition based RLS variants (QRD, Inverse QRD, Fast 
QRD) were developed, which are widely used in variety of 
applications, which leads to more efficient architectures, 
stability and accuracy [5], [6]. In the paper, the compari-
sons of few RLS variants are performed and select IQRD 
RLS as the least computationally expensive algorithm for 
hardware application.    
The primary goal of this paper is to propose and de-
velop a high-throughput architecture of the selected RLS 
variant, by minimizing its latency and device area occupa-
tion for adaptive beamforming application. The speed of 
the beamformer is limited due to recursive operation based 
AWC unit, therefore, minimizing latency is a critical issue 
[4], [7]. In this paper, a novel architecture is proposed for 
IQRD RLS to reduce the computational complexity by 
using the standard pipelining methodology and mapped 
into Virtex5  FPGA  because it provides  the powerful com- 
852 R. IRFAN, H. RASHEED, W. A. TOOR, FPGA-BASED LOW LATENCY IQRD ARCHITECTURE FOR ADAPTIVE BEAMFORIMG … 
 
X
X
X
+
AWC 
Core
X
w1w2w3w4
+
s
x1
x2
x4 z
d
signal
x3
e
  
Fig. 1.  Adaptive beamformer receiver with ULA structure. 
putational architecture features for floating point 
arithmetic. 
The paper is organized as follows: first of all, a brief 
introduction of adaptive beamforming is presented in 
Sec. 2. Further, in Sec. 3, a thorough study is performed to 
select IQRD as the most suitable RLS variant. Section 4 
describes the conventional systolic array architecture and 
the proposed new architecture based on low latency mod-
ules. Finally, in Sec. 5, AWC unit implementation in Vir-
tex 5 FPGA is discussed to confirm the advantages of the 
new structure. 
2. Adaptive Beamforming 
The core of adaptive beamformer receiver is to detect 
and estimate the target signal at the input of adaptive an-
tenna array. Such systems are usually consisted of an an-
tenna array along with adaptive processor (AWC) to adjust 
its weights in real time, such that the main lobe of the radar 
is focused on the arriving direction of target signal and 
suppresses undesired signals [3], [6].  
The output of beamformer in Fig. 1 is given as: 
 ( ) ( ) ( )k k k e d z   (1) 
where z be the vector array of estimated signal given as   
 Xw z   (2) 
where w (m × 1) be weights vector, X (m × N, m > N) is the 
input data matrix, d (N × 1) is the desired training signal 
and e (N × 1) is the additive noise error vector. The pri-
mary goal of the adaptive beamformer is to minimize the 
mean square error, such that z(k) ≈ d(k) and find out the 
optimized values of unknown weight vector w. 
 
2T 2
0
( ) ( ) ( ) ( ) ( )
k
k i
i
k i i k k  

     d w x e   (3) 
where λ is a forgetting factor in conventional RLS [4].  
3. RLS Variants for Adaptive Beam-
forming 
In this section, a detail study of RLS variants is per-
formed to select the most efficient one for the AWC unit in 
adaptive beamformer. But, first of all a brief introduction 
of QR decomposition based on Givens Rotation (GR) 
method is given. A GR rotation matrix c s
s c
 
   
G  is 
used to premultiply a two-row matrix and transformed it 
into upper triangular matrix. Let two-row matrix be given 
as: 
 1 2
1 2
...
...
m
m
r r r
x x x
   
  
 
A .  (4) 
Premultiplying matrix A with G we get 
 1 21 2
21 2
......
0 ......
mm
mm
r r rc s r r r
x xs c x x x
         
          
 (5) 
where 21
2
11 xrr    , (6) 
 
2
1
2
1
1
xr
r
c



 ,   (7) 
  
2
1
2
1
1
xr
x
s



.  (8) 
In [7], a numerically stable RLS variant QRD RLS, is 
described which avoids the complex matrix inversion oper-
ations. Moreover, it has the possibility of implementation 
in systolic array architecture (pipelined architecture). 
Hence, many researchers considered QRD RLS as the most 
suitable adaptive algorithm for hardware implementation 
[8–10]. The QRD decomposes input data matrix X through 
an orthogonal triangularization approach like Givens rota-
tion (GR), Householder transformation or Gram Schmidt 
orthogonalization [11] 
 X QR . (9) 
In most of the cases GR is considered as the most 
numerically well-conditioned method. Q be the unitary 
matrix (QQT = I) generated as the sequence of Givens 
rotation matrices in [11], and R is the upper triangle matrix 
which is also the Cholesky factor of autocorrelation matrix 
R=XTX [11]. So, equation (9) can be written as: 
 QX R   (10) 
Premultiplying (1) with Q we get 
  Qe Qd QXw .   (11) 
Let e’ = Qe and d’ = Qd, so equation (11) can be written as 
   e d Rw .  (12) 
From (12) the weight update equation can be written as 
 1 , 0   w R d e . (13) 
The above equation is the solution of least square problem 
which is obtained by back substitution process. Hence, the 
QRD RLS algorithm is composed of two steps: first the QR 
RADIOENGINEERING, VOL. 26, NO. 3, SEPTEMBER 2017 853 
 
z2
r00
r11
z0
start rdy
s
c
start
rdy
x_new
s
c
BC IC
x1x0
r_new x_new
IC
s
c s
c
s
c
x2
r_new x_new
ICs
c
s
c
d
start rdy
s
c
start
rdy
r_new r_new x_new
s
c
BC IC
r_new x_new
IC
s
c s
c
s
c
r01
r_newr_new
r02
r12
r_new
start rdy
s
c
BC
r00
r_new
z1
start
s
c
r_new
rdy
IC
rdy
Back 
Substitution
w  
Fig. 2.  QRD RLS systolic array. 
decomposition of data matrix X (9) is performed and has 
systolic realization. In step 2, back substitution process 
begins for weight calculation (13), as shown in Fig. 2. 
IQRD RLS is another important variant of RLS pro-
posed by Alexander and Ghirnikar [12], to avoid the nu-
merically tedious back substitution process for weight 
calculation. The core idea is to update the inverse matrix R 
(13) in each iteration of systolic array. So, the weights of 
filter are computed within systolic array. In [11], a new 
weight update equation is derived which is given as: 
  1 T( ) ( 1) ( ) ( ) ( ) ( )k k k k k k   w w e R R x  (14) 
where the term R-1(k)R-T(k)x(k) is known as the Kalman 
gain. The systolic realization of IQRD RLS will be dis-
cussed in Sec. 3. Hence, IQRD RLS is less computationally 
expensive as compared to QRD RLS.  
In Fast QRD (FQRD) RLS, matrix update equation is 
replaced by vector update equations. In [11] detail deriva-
tions of these equations are given, which further reduces 
the numerical complexity as compared to QRD RLS and 
IQRD RLS algorithms. Although, FQRD RLS is numeri-
cally less complex, but it has no weight update procedure, 
so it is limited to applications which try to estimate the 
output error vector [13].  
In this section, three important RLS variants and their 
properties are discussed and summarized in Tab. I. It is 
worth mentioning that QRD and IQRD are computationally 
more complex than FQRD, but FQRD has no weight up-
date procedure. So, we cannot consider it for adaptive 
beamforming application. On the other hand, QRD in-
volves more computations (two step procedure) so it is also 
not preferable. Due to above reasons, IQRD RLS is the 
well-suited algorithm for hardware implementation. 
 
Algorithm 
Computational 
Complexity 
Weight Update 
Procedure 
QRD RLS O(N2) Back Substitution
IQRD RLS O(N2) 
Within Systolic 
array 
FQRD RLS O(N) No Procedure 
Tab. 1. Properties of RLS variants. 
3.1 Performance Analysis of Adaptive 
Beamforming in Terms of Radiation 
Pattern  
The performance analysis of adaptive beamforming, 
using RLS adaptive algorithm is done by varying different 
beamformer parameters. Researchers have performed the 
analysis of adaptive beamformers for smart antennas based 
on adaptive algorithms Least Mean Squares (LMS), Sam-
ple Matrix Inversion (SMI), Recursive Least Squares 
(RLS) and Conjugate Gradient Method (CGM). The analy-
sis is done by varying the inter element spacing and the 
number of antenna elements for each algorithm and study-
ing the radiation patterns, amplitude response, mean square 
error and absolute weights for adaptive beamforming algo-
rithms [14], [15]. We assume a ULA beamformer with 
an operating frequency of 1 GHz. It receives an echo signal 
from the single target at azimuth of 30° and an interference 
signal at –45°. Simulations show that the RLS adaptive 
algorithm rejects the interference at –45° and enhanced the 
target signal at 30°. The following procedures are adopted 
for further analysis of RLS based adaptive beamformer: 
 Change the number of antenna elements. 
 Change the antenna elements inter space. 
Figure 3 depicts the performance of beamformer with 
different number of antenna elements. The selected num-
bers of antennas are 10, 15 and 20. It is clear from the 
radiation pattern that when the number of antenna in-
creases the main lobe of the beamformer is shrinked and 
pointed accurately towards the target angle and the side 
lobes are much  suppressed.  On the other hand, the perfor- 
 
Fig. 3.  Radiation pattern with different number of antennas N. 
854 R. IRFAN, H. RASHEED, W. A. TOOR, FPGA-BASED LOW LATENCY IQRD ARCHITECTURE FOR ADAPTIVE BEAMFORIMG … 
 
 
Fig. 4. Radiation pattern with different inter antenna element 
space D. 
mance is also judged in terms of changing antenna ele-
ments inter space. The selected inter spaces are 0.5, 1.5 and 
2.5 mm. Figure 4 is the resulting radiation pattern in which 
the performance of the adaptive beamformer is improved 
as the distances between the antenna elements are in-
creased. The side lobes are also suppressed in a much bet-
ter way.  
Hence, the RLS based adaptive beamformer has 
a good performance and accuracy as the numbers of an-
tenna elements are increased as well as the distance be-
tween each antenna elements is also increased.  
4. IQRD Systolic Array Architecture 
Systolic arrays are the well-known concept for matrix 
triangularization and inversion [16], [17] and the same idea 
is adopted for IQRD-RLS algorithm. Figure 5 shows the 
conventional systolic realization of the IQRD-RLS algo-
rithm. The IQRD systolic triangular array consists of four 
types of processing cells; Boundary Cell (BC), Internal 
Cell (IC), Inverse Internal Cell (IIC) and Final Cell (FC). 
A Boundary Cell determines the rotation parameters 
cos α and sin α to perform the rotations on input complex 
data rows x, from data matrix X and update the corre-
sponding row values of R matrix which is an upper trian-
gular matrix. It also sends the rotation parameters to the 
right neighboring Internal Cells [16], [18]. The corre-
sponding equations (6), (7) and (8) are modified as: 
 
22
cos
r
c
r


 
 x
,   (15) 
 
22
sin
x
s
r


 
 x
.   (16) 
BC is more complex than IC, IIC and FC because it 
performs square-root and inverse operations. All the other 
cells perform only multiplication and inverse operations. 
Researchers have performed a lot of work to find a square-
root and division free BC unit for systolic arrays. In [19], 
the following data transformations are applied to seek the 
square-root and division free expressions for transformed 
data. 
1 1 2
1 1 1
, , , 1,2,...,j j j j j jr a x b r a j m  
     ,(17) 
 mjbx jj ,...,3,2,
1
2


.   (18) 
However, the number of multiplications is increased 
due to these transformations, but it is a good approach to 
improve the system latency and reduce number of compu-
tations. In [8], another single cycle look-up table (LUT) 
based approach is used to replace the square root and in-
verse operations and to optimize the QRD RLS latency. 
This approach gives much improvement in overall perfor-
mance of QRD RLS algorithm. 
An Internal Cell (IC) received the rotation parameters 
cos α and sin α from the Boundary Cell (BC) in each row, 
and performed the rotations on the input values of data 
matrix X and updated the x and r values. The correspond-
ing update equations are given as follows [16], [17]: 
 nr c r s x    ,  (19) 
 nx c x s r    .   (20) 
Inverse Internal Cells receive the sine and cosine 
value of the rotation angle from the internal cell in the 
same row. Some part of the inverse internal cell operation 
is similar to the internal cell; i.e. it rotates the input data by 
multiplying it with input rotation angles. Along with this 
operation, Inverse internal nodes of IQRD-RLS use λ(–½) in 
(12) and (13). In this paper, we consider the case when 
λ = 1, so that IIC is equivalent to IC units. 
Final Cell (FC) performs two functions: it received 
the rotation parameters cos α and sin α to rotate the input 
data and generate the updated filter weights using the fol-
lowing relation [10] 
 
22( 1)w r      x  (21) 
where γ is computed using the following relation 
 


N
i
i kk
0
)(cos)(  . (22) 
A design methodology has been proposed to optimize 
the throughput and latency of the IQRD-RLS architecture. 
In Fig. 6, a block diagram of the proposed architecture is 
presented, which utilizes the standard pipelining methodol-
ogies and has 50% reduction in number of BC, IC and IIC 
units (Tab. 2). We have further optimized the IQRS RLS 
architecture, by omitting the square-root and division oper-
ations in BC, in terms of single cycle single cycle look up 
table (LUT) as shown  in Fig. 7.  Generally, a look up table 
RADIOENGINEERING, VOL. 26, NO. 3, SEPTEMBER 2017 855 
 
 
Fig. 5.  Conventional IQRD systolic array. 
 
Fig. 6.  The proposed IQRD architecture. 
856 R. IRFAN, H. RASHEED, W. A. TOOR, FPGA-BASED LOW LATENCY IQRD ARCHITECTURE FOR ADAPTIVE BEAMFORIMG … 
 
(LUT) is a table that determines what the output is for any 
given input. In FPGA truth tables are used to implement 
any digital logic. In [8], the same methodology is adopted 
for QRD RLS algorithm. This scheme avoids the complex 
hardware implementation with considerable decrease in 
latency and improves the speed of overall system. In some 
previous work the conventional implementation of BC is 
based on CORDIC algorithm, for non-recursive applica-
tions [20], [21]. But, digital beamforming is a recursive 
procedure, so, the conventional approach would not pay off 
because it leads to considerable latency. In the proposed 
architecture, the pipelined registers (R_reg, x_reg, 
Rinv_reg) are introduced which not only holds the updated 
values, but also feedback them to BC, IC and IIC cells. The 
controller CON handles the control signals as well as feed-
back timings. Hence equations (15) and (16) can be modi-
fied as:  
 
22 2 2 2
s i rA r r x x     x ,  (23) 
 22
ss i
ss
1
,A r A
A
  x ,  (24) 
 ic r A  ,  (25) 
 ,i i i r r rs x A s x A    .   (26) 
The IC unit received the c, si and sr values from 
neighboring BC to compute (19) and (20) and update the 
values of xni, xnr and rn. These computations required only 
adder and multiplier units as shown in Fig. 8. The updated 
values xni, xnr are stored in pipelined register x_reg and also 
feedback to IC module, and updated rn is stored in internal 
memory of IC. The c, si and sr are also sent to the neigh-
boring IC (Fig. 8).  
One of the major differences between the LUT im-
plemented in [8] and in this paper, is the floating-point 
representation format. In [8], the latency optimization of 
LUT is discussed with a defined N bits fixed point number 
format, including 1 sign bit, i integer bits (i > 1) and f frac- 
  
Fig. 7.  Boundary Cell (BC) architecture. 
 
Fig. 8.  Internal cell (IC) architecture. 
 
Data 
matrix 
X 
[size] 
No. of BC No of IC No of IIC 
C
on
ve
n
ti
on
al
 
A
rc
h
it
ec
tu
re
 
P
ro
p
os
ed
 
A
rc
h
it
ec
tu
re
 
C
on
ve
n
ti
on
al
 
A
rc
h
it
ec
tu
re
 
P
ro
p
os
ed
 
A
rc
h
it
ec
tu
re
 
C
on
ve
n
ti
on
al
 
A
rc
h
it
ec
tu
re
 
P
ro
p
os
ed
 
A
rc
h
it
ec
tu
re
 
3×3 4 1 6 3 6 3 
4×4 5 1 10 4 10 4 
5×5 6 1 15 5 15 5 
Tab. 2. Comparison between the proposed and conventional 
IQRD architecture. 
tion bits (f > 1). But, the floating-point multiplier and adder 
are not discussed with the new defined number format. 
In this paper, we have utilized the same LUT ap-
proach for IQRD RLS architecture with single precision 
IEEE 754 format due to its wider range over fixed point. 
The single-precision numbers are stored in 32 bits: 1 for 
the sign, 8 for the exponent, and 23 for the fraction [22]. In 
the proposed architecture, the computation of Ai and Ass are 
handled in single LUT, whereas As is the 32-bit input 
address field to the LUT as shown in Fig. 7. Simulation 
results show that 97% of the  As values are greater than 1 as 
 
Fig. 9.  (a) Histogram of As. (b) Histogram of Ai. 
RADIOENGINEERING, VOL. 26, NO. 3, SEPTEMBER 2017 857 
 
shown in Fig. 9a, when the all input variables in (23) have 
a typical uniform distribution within the range of [–4, 4], 
and their corresponding output values (Ai) are very small 
(Fig. 9b). Hence, probability of occurring a value less than 
1 is very low thus reduces the range of input values As for 
LUT. The output of the LUT Ai has 32-bit bandwidth, 
which is used to compute the c, si and sr values, for the IC 
units. 
5. Implementation and Results 
In this paper, we have optimized the BC module of 
systolic array in terms of complex mathematical operations. 
Many researchers also performed a lot of work in single 
precision floating point multiplier which is also a complex 
mathematical operation in hardware [23–25], but we keep 
our focus in divider and square-root operations and use the 
multiplier core of the FPGA. Xilinx Virtex 5 FPGA is the 
selected platform for hardware implementation.  
As we have already discussed, adaptive algorithm is 
a basic part of the beamforming system and we select the 
IQRD-RLS algorithm as the most suitable algorithm. Three 
types of architectural modules are designed to judge the 
performance of algorithm in terms of latency, throughput 
(1sample/clock Time×clock cycles), error and efficiency 
(throughput/number of slices). The details are: 
1. MD0 (module 0), a conventional systolic array archi-
tecture with single precision floating point multipli-
ers, dividers and a square root module (generated 
with Coregen software). 
2. MD1 (module 1), a proposed systolic array architec-
ture with single precision floating point multipliers, 
dividers and a square root module (generated with 
Coregen software). 
3. MD2 (module 2), a proposed systolic array architec-
ture with LUT based BC for divider and square root 
operation. 
where MD stands for module. A radar target detection 
scenario is simulated in MATLAB to generate the data 
points for the testing of MD0, MD1 and MD2. First, a radar 
received signal is modelled having one target at azimuth 
angle 30°, and two interferences at azimuth 15° and 45°, 
respectively. Assume that both targets and radar lie at the 
same; i.e. elevation angle is equal to zero. Figure 10 shows 
the radiation pattern. The received signal is impinged on 3 
element ULA beamformer based phased array radar to 
generate almost 104 data points.   
In adaptive beamforming application, latency is 
an important performance metric which affects the overall 
system throughput. To judge the LUT effects on algorithm 
performance we compare the latencies and throughput of 
the architectural modules. Table 3 details the latency of 
each unit in system individually, and throughput of adap-
tive filter (3-tap) for IQRD-RLS algorithm with three dif-
ferent architectural modules; i.e. MD0, MD1, MD2. These 
modules are modelled using Verilog and their behavior 
description is placed and routed on Virtex5 device by 
ISE14.1i tool. It is clear from Tab. 3, the proposed archi-
tecture based module MD2 module has offered 9 times 
reduction in latency as compared to MD0 and MD1 mod-
ules because of its optimized single cycle LUT based ar-
chitecture. On the other hand, MD2 module of IQRD-RLS 
has a higher throughput as compared to QRD-RLS M2 
module. It is again important to notify that in [8], QRD-
RLS is implemented using a new floating point format, 
whereas we implement the IQRD-RLS with standard 
IEEE-754 single precision floating point representation. 
Hence, it is again proved that IQRD-RLS is the most suita-
ble algorithm for adaptive beamforming application. 
Algorithm accuracy of the proposed architecture and 
in presence of LUT is found out by measuring the differ-
ence between the weight coefficients of the ideal method 
and the proposed method in MD2 with 10
4 simulated sam-
ples. This is also known as weight error. Results show that 
the weight error is less than 0.00216 for 96% samples. 
Figure 11 illustrates the filter weights of the ideal and pro-
posed methods for 3 tap beamformer (For clarity of the 
image we plot only 200 iterations).  
The device utilization details or hardware summary 
are given in Tab. 4. It is noteworthy that MD2 outperforms 
in terms of area efficiency and throughput. The conven-
tional architecture MD0 has the highest area efficiency, 
while MD1 occupies much lesser area as compared to MD0. 
The hardware efficiency and throughput are found out to 
draw a reasonable comparison between the architectures. 
To the best of our knowledge, this is the first implementa-
tion of IQRD RLS algorithm in FPGA, so, the performance 
of MD2 is also measured, after it is synthesized for differ-
ent target FPGAs as shown in Tab. 5. The throughput of 
the IQRD proposed architecture in all FPGAs is better than 
the QRD [8]. Hence, simulation results have clearly indi-
cated that the IQRD RLS is the most suitable algorithm for 
adaptive beamforming due to its better throughput, hard-
ware efficiency and latency. 
 
Fig. 10. Radiation pattern. 
858 R. IRFAN, H. RASHEED, W. A. TOOR, FPGA-BASED LOW LATENCY IQRD ARCHITECTURE FOR ADAPTIVE BEAMFORIMG … 
 
 
Fig. 11.  Weight error. 
 
 
Algorithm IQRD RLS QRD RLS [8] 
Modules MD0 MD1 MD2 Modules M2 
BC 46 46 3 BC 5 
IC 5 5 5 IC 3 
FC 5 5 5 --- --- 
Total 455 455 52 Total 76 
Throughput 
[MSamples/sec] 
0.44 0.44 3.92 
Throughput 
[MSamples/sec]
2.45 
Tab. 3. Comparison between IQRD and QRD [8] latency and 
throughput. 
 
Hardware Resources  MD0 MD1 MD2 
Slices 61% 28% 3% 
4-input LUT 18% 14.4% 5% 
DSP48 55% 41% 5% 
ROM 0% 0% 52% 
Avg. Area 38.2% 21.6% 13.92% 
Area Efficiency 0.011518 0.02037 
0.2292
3 
Tab. 4. Hardware resources. 
 
No. System Slices. Throughput
1. 
Spartan-3 
(exc3s500) 
5891 3.01 
2. Virtex-4 (vsx55) 5603 3.24 
3. Virtex-5 (vtx150) 5001 3.92 
Tab. 5. Comparison between FPGAs. 
6. Conclusion  
A low latency architecture is presented that imple-
ments the IQRD algorithm for beamforming application 
using system level hardware tools. The proposed architec-
ture improves the divider and square root operations via 
LUT approach. Moreover, it is mapped in Virtex5 FPGA 
by Xilinx and achieved the minimum errors between the 
real and ideal weight values. Hence, the proposed archi-
tecture is outperforming in terms of area efficiency, 
throughput, and latency, and the most suitable algorithm 
for adaptive beamforming. In future work, the multiplier 
and adder units can be implemented via LUT, for further 
improvement in latency and throughput.   
References 
[1] FENN, A. J. Adaptive antenna and phased arrays. Lectures MIT 
Lincoln Laboratory, Available at: 
http://www.ll.mit.edu/workshops/education/videocourses/antennas
/lecture8/lecture.pdf 
[2] MAILLOUX, R. J. Phased Array Radar Handbook. 2nd ed., rev. 
London (UK): Artech House, 2005. ISBN: 1580536891 
[3] FLORENS, C., RAIDA, Z. Adaptive beamforming using genetic 
algorithms. Radioengineering, 1998, vol. 7, no. 3, p. 1–6.  
[4] HAYKIN, S. O. Adaptive Filter Theory. 4th ed., Prentice Hall, 
2003. (p. 436–465, 506–534) ISBN: 0130901261 
[5] DINIZ, P. S. R. Adaptive Filtering: Algorithms and Practical 
Implementation. 2nd ed. USA: Kluwer Academic Press, 1997. 
(p. 195–389). DOI: 10.1007/978-0-387-68606-6 
[6] JIAN LI, STOICA, P. Robust Adaptive Beamforming. Wiley Series 
in Telecommunication and Signal Processing, 2006, (p. 49–79). 
ISBN: 9780471678502 
[7] MCWHIRTER, J. G. Recursive Least Square Minimization using a 
Systolic Array. Proc SPIE 0431, Real-time Signal Processing VI, 
1983, vol. 431 (p. 105–112). DOI: 10.1117/12.936448 
[8] ALIZADEH, M. S., BAGHERZADEH, J., SHARIFKHANI, M. 
A low latency QRD RLS architecture for high throughput 
applications. IEEE Transactions on Circuits and Systems, 2016, 
vol. 63, no. 7 p. 708–712. DOI: 10.1109/TCSII.2016.2530169 
[9] DESHPANDE, A. P., MURTHY. N. S., GOVIND RAO, D., et al. 
Efficient filter implementation using QRD-RLS algorithm for 
phased array radars application. In IEEE International Conference 
for Technological Advances in Electrical, Electronics and 
Computer Engineering (TAEECE). Konya (Turkey), 2013. DOI: 
10.1109/TAEECE.2013.6557275 
[10] MARTINEZ, M. E. I. Implementation of QRD RLS algorithm on 
FPGA: Application to Noise Canceller System. IEEE Latin 
America Transactions, 2011, vol. 9, no. 4, p. 458–462. DOI: 
10.1109/TLA.2011.5993728 (in Spanish) 
[11] APOLINARIO Jr., J. A. (Ed.) QRD RLS Adaptive Filtering. New 
York (USA): Springer Media, 2009. (p. 51–113). DOI: 
10.1007/978-0-387-09734-3 
[12] ALEXANDER, S. T., GHIRNIKAR, A. L. A method for recursive 
least squares filtering based upon an inverse QR decomposition. 
IEEE Transactions on Signal Processing, 1993, vol. 41, no. 1, 
p. 20–30. DOI: 10.1109/TSP.1993.193124 
[13] SHOAIB, M., WERNER, S., APOLINARIO JR., J. A., et al., 
Equivalent output-filtering using Fast QRD-RLS algorithm for 
burst-type training applications. In IEEE International Symposium 
on Circuits and Systems (ISCAS). Island of Kos (Greece), 2006, 
DOI: 10.1109/ISCAS.2006.1692540 
[14] SAXENA, P., KOTHARI. A. G. Performance analysis of adaptive 
beamforming algorithms for smart antennas. IERI Procedia 
(special issue International Conference on Future Information 
Engineering), 2014, vol. 10, p. 131–137. DOI: 
10.1016/j.ieri.2014.09.101 
[15] AZIZ, A., QURESHI, M. A, JUNAID IQBAL, J., et al. 
Performance and quality analysis of adaptive beamforming 
algorithms (LMS, CMA, RLS & CGM) for smart antennas. In 
IEEE 3rd International Conference on Computer and Electrical 
Engineering (ICCEE). Chengdu (China), 2010, p. V6-302–V6-
306. ISBN: 978-1-4244-7224-6 
[16] GENTLEMAN, W. M., KUNG, H. T, Matrix triangularization by 
systolic arrays. Proc. SPIE Real-time Signal Processing, 1981, 
vol. 298, p. 19–26. DOI: 10.1117/12.932507 
RADIOENGINEERING, VOL. 26, NO. 3, SEPTEMBER 2017 859 
 
[17] MILOVANOVIC, E. I., STOJCEV, M. K., MILOVANOVIC, I. 
Z., et al. Design of linear systolic arrays for matrix multiplication. 
Advances in Electrical and Computer Engineering (AECE), 2014, 
vol. 14, no. 1, p. 37–42. DOI: 10.4316/AECE.2014.01006 
[18] TANG, C. E. T., LIU, K. J. R., TRETTER, S. A. Optimal weight 
extraction for adaptive beamforming using systolic array. IEEE 
Transactions on Aerospace and Electronics Systems, 1994, 
vol. 30, no. 2, p. 367–385. DOI: 10.1109/7.272261 
[19] FRANTZESKAKIS, E. N., LIU, K. J. R. A class of square root 
and division free algorithms and architectures for QRD based 
adaptive signal processing. IEEE Transactions on Signal 
Processing, 2002, vol. 42, no. 9, p. 2455–2469. DOI: 
10.1109/78.317867 
[20] KARKOOTI, M., CAVALLARO, J. R., DICK, C. FPGA 
implementation of matrix inversion using QRD-RLS algorithm. In 
Conference Record of IEEE 39th Asilomar Conference on Signals, 
Systems and Computers. Pacific Grove (CA, USA), 2005, p. 1625 
to 1629. DOI: 10.1109/ACSSC.2005.1600043 
[21] YHONG, T. M., MADHUKUMAR, A. S., CHIN, F. QRD-RLS 
adaptive equalizer and its CORDIC-based implementation for 
CDMA systems. International Journal on Wireless and Optical 
Communications, 2003, vol. 1, no. 1, p. 25–39. DOI: 
10.1142/S0219799503000033 
[22] HOLLASCH, S. IEEE Standard 754 Floating Point Numbers. 
Available at: 
http://steve.hollasch.net/cgindex/coding/ieeefloat.html 
[23] RAMESH, A. P, TILAK, A. V. N., PARSAD, A. M. An FPGA 
based high speed IEEE-754 double precision floating point 
multiplier using Verilog. In 2013 International Conference on 
Emerging Trends in VLSI Embedded System, Nano Electronics 
and Telecommunication System (ICEVENT). Tiruvannamalai 
(India), 2013. DOI: 10.1109/ICEVENT.2013.6496575 
[24] KODALI, R. K, GUNDABATHULA, S. K., BOPPANA, L. FPGA 
implementation of IEEE 754 floating point Karatsuba multiplier. 
In International Conference on Control, Instrumentation, Commu-
nication and Computational Technologies (ICCICCT). Kanya-
kumari (India), 2014. DOI: 10.1109/ICCICCT.2014.6992974 
[25] PALDURAI, K., HARIHARAN, K. FPGA implementation of 
delay optimized single precision floating point multiplier. In IEEE 
International Conference on Advanced Computing and 
Communication System (ICACCS). Coimbatore (India), 2015. 
DOI: 10.1109/ICACCS.2015.7324094 
About the Authors ... 
Raafia IRFAN has received her MSc degree in Electronics 
from Quaid-e-Azam University (QAU) in 2004 and done 
her MS System Engineering from the Pakistan Institute of 
Engineering and Applied Sciences (PIEAS) in 2006. Cur-
rently, she is working towards the Ph.D. degree in Electri-
cal Engineering from PIEAS. For last 10 years, she is 
working as a senior scientist in a research organization. 
Her expertise and research interest are FPGA based com-
munication system design, radar signal processing and high 
speed VLSI designs. 
Haroon-ur-RASHEED is a senior faculty member at the 
Department of Electrical Engineering, Pakistan Institute of 
Engineering and Applied Sciences (PIEAS). He has done 
his Bachelors in Electrical Engineering from the University 
of Engineering and Technology (UET) Lahore in 1990 and 
completed his master from Iowa State University (ISU), 
USA. He completed his Ph.D. from the School of Com-
puter Science & Technology, Beijing Institute of Technol-
ogy in 2009. At PIEAS he has taught courses in “Digital 
Design in Verilog”, “Digital Communication”, “Micropro-
cessor based Design” and “Digital Image Processing”. He 
has supervised many final year projects at NUST as well as 
PIEAS during his career as a teacher. He and his students 
have published a number of research papers in interna-
tional conferences before starting PhD studies at BIT. He is 
also a professional member of IEEE. 
Waqas Ahmed TOOR did his Bachelors in Electrical 
Engineering from NUST in 2003. He received his MS 
System Engineering degree from PIEAS in 2006. Now he 
is doing his MS leading to Ph.D. in Electrical Engineering 
from the Capital University of Science and Technology 
(CUST). For last ten years, he is working in an R&D or-
ganization as a Senior Engineer. His research interests are 
high power microwave devices and their applications, 
power electronics and radar signal processing. 
 
