Interference Mitigation for WCDMA using QR Decomposition and a CORDIC-based Reconfigurable Systolic Array by Scheibler, Robin et al.
???????? ???? IP ?????
SYSTEM IP CORE LABORATORY,
NEC CORPORATION
?????
INTERNSHIP REPORT.
Interference Mitigation for WCDMA Using QR Decomposition and a
CORDIC-based Recon¯gurable Systolic Array
Robin SCHEIBLERy, James OKELLOyy, Katsutoshi SEKIyy, Tomoyoshi KOBORIyy,
and Masao IKEKAWAyy
y Swiss Federal Institute of Technology, Lausanne, Switzerland
yy System IP Core Laboratory, NEC Corporation
Abstract This paper presents implementation and performance of QR Decomposition based Recursive Least-
-Squares (QRD-RLS) for interference mitigation in Wideband CDMA (WCDMA). The implementation is carried on
CORSAEngine which is a new Software-De¯ned Radio (SDR) processor developed by NEC Corporation and highly
optimized for MIMO-OFDM systems. It is shown how QRD-RLS can be mapped on its rectangular CORDIC-based
recon¯gurable systolic array, hence demonstrating its capability to process WCDMA. In addition, the performance of
CORSAEngine is compared to that of other architectures and it is found to achieve at least 91% of the performance
of dedicated hardware in terms of computational density.
Key words Software-De¯ned Radio, QR Decomposition, Wideband CDMA, Interference Mitigation
1. Introduction
In 1991, Mitola [15] introduced the concept of Software-
De¯ned Radio (SDR) that allows operations of di®erent
modes of communications systems on a single hardware, dra-
matically decreasing equipment costs and development time
of new technologies. While programmability is attractive to
mobile communication equipments manufacturers and oper-
ators, it also brings one of the biggest challenges of SDR. The
need to maintain high performance while retaining enough
°exibility to process as many di®erent standards as possible.
This constraint becomes even more di±cult to ful¯ll as mod-
ern communication standards require more complex signal
processing technology.
In the ¯eld of cellular communications, such modern stan-
dards are usually referred to as Beyond 3G (B3G) technolo-
gies. It has been recognized that B3G systems, already ex-
empli¯ed by WiMAX and 3GPP LTE among others, will
heavily rely on Orthogonal Frequency Division Multiplex-
ing (OFDM) and Multiple-Input Multiple-Output (MIMO)
technologies [18]. But at the same time, it is important for
an SDR to support non-OFDM-based standards like IS-95,
CDMA2000 and WCDMA. Firstly, those systems enjoy a
very deep market penetration and are likely to remain used
for many years. Secondly, in the case of WCDMA, it has the
potential to be used in conjunction with an OFDM scheme
such as in Multi-Carrier Code Division Multiple Access (MC-
CDMA).
Recently, NEC Corporation developed CORSAEngine, a
new SDR processor highly optimized for MIMO-OFDM sys-
tems [17]. Its rectangular COordinate Rotation DIgital Com-
puter (CORDIC) based recon¯gurable systolic array makes it
highly suitable to process the computationally intensive base-
band algorithms required by those systems, among others
QR Decomposition (QRD), Singular Value Decomposition
(SVD), least-squares ¯t or fast Fourier transform. However,
performance of e±cient interference mitigation algorithms
for WCDMA had not been investigated on this processor.
This paper presents the implementation and the perfor-
mance of QR Decomposition based Recursive Least-Squares
(QRD-RLS) for interference mitigation of WCDMA on
CORSAEngine. QRD-RLS has been shown to e®ectively
mitigate both Intersymbol Interference (ISI) and Multiple
Access Interference (MAI), outperforming the conventional
Rake while maintaining reasonable complexity when imple-
mented as a systolic array [14]. In this paper, it is shown how
this arbitrarily large systolic array can be split into parts that
¯t on the reduced size array of CORSAEngine. Through re-
con¯gurability, it is furthermore possible to run successively
those di®erent parts on the same hardware structure.
The remainder of this paper is organized as follows. Sec-
tion 2 gives a brief revision of conventional and QRD-RLS
based interference for WCDMA along with simulation re-
sults and computational load comparison of those two meth-
ods. In Section 3, the architecture of CORSAEngine is de-
scribed. The mapping of QRD-RLS onto the systolic array
| 1 |
uCPICH
Matched Filter
Correlator
user
Finger 2
Finger 1
Finger K
Search
Paths
Estimation
Channel
r(n) ^d (n)
 
 
 
1
2

*
*
*
Fig. 1 Block diagram of the conventional Rake receiver.
user
QRD−RLS
Combiner
r (n)^u
r (n)^p
ud (n)^r(n)
w
CPICH
Matched Filter
Matched Filter
Fig. 2 Block diagram of WCDMA receiver based on QRD-RLS.
is described in Section 4. Finally in Section 5, the perfor-
mance of the implementation of WCDMA on CORSAEngine
is assessed and a benchmark against other devices is done.
Section 6 concludes this paper.
2. Interference Mitigation in WCDMA
2. 1 Conventional Interference Mitigation
The conventional interference mitigation for WCDMA is
characterized by the Rake receiver shown in Fig. 1. It uses
short-time averaging (typically two slots) of the received pi-
lot symbols to estimate the channel characteristics. Then,
long-time averaging (about one frame) is used to get a good
power delay pro¯le of the channel. The Path Search uses a
threshold-based algorithm to select the paths with a su±-
ciently large Signal-to-Noise Ratio (SNR). Those paths are
despread using a bank of correlators and combined accord-
ing to the Maximum Ratio Combining (MRC) principle with
respect to the channel coe±cients. For di®erent algorithm
for channel estimation and path search, refer to [8], [9]. For
more details about the principles of the Rake receiver, refer
to [16].
2. 2 QRD-RLS Interference Mitigation
This section describes QRD-RLS based interference miti-
gation applied to WCDMA. A block diagram of the receiver
considered is shown in Fig. 2. First the Common Pilot CHan-
nel (CPICH) and the signal of the user of interest are de-
spread using Matched Filters (MF) corresponding to their
respective spreading codes. The despread pilot signal r^p(n)
is then sent to the QRD-RLS weight calculation unit which
produces the optimal weight vector w. It is then sent to
the combiner and used to combine the despread user signal
r^u(n).
In the WCDMA system, the despread signal can be written
as in [6] :
r^(n) = ¾ld(i) + I(n) + »(n); (1)
where the time index n = iF + l with i 2 N and l 2
f0; : : : ; F¡1g, F is the spreading factor, ¾l is a multiplica-
tive coe±cient introduced by the channel impulse response
and the spreading code autocorrelation function and d(i) is
the ithsymbol sent. I(n) is an interference term created by
the ISI and the MAI. »(n) is the ¯ltered noise. Let's de¯ne
u(i) = [r^(iF ); : : : ; r^(iF +M¡1)]T , a vector containing the
M ¯rst chips of the despread signal corresponding to the ith
symbol sent. The goal is then to ¯nd the optimal weight
vector w(m) = [w0(m); : : : ; wM¡1(m)]T to combine the ele-
ments of uu(i) = [r^u(iFu); : : : ; r^u(iFu+M¡1)]T in order to
enhance the symbol du(i) and reduce the interference signal
I(n), m being the number of symbols received so far. Sub-
script p and u are used to distinguish between pilot and user
signals.
QRD-RLS is a technique borrowed from the adaptive ¯l-
tering theory [11]. To adaptively calculate w(m), it attempts
to minimize the following error function :
E(m) = k¤(m)(A(m)w(m)¡ d(m))k (2)
where A(m) = [up(0); : : : ;up(m)]
H contains the received pi-
lot signal, ¤(m) = diag(¸m=2; : : : ; ¸1=2; 1) the exponentially
decreasing forgetting factor and d(m) = [dp(0); : : : ; dp(m)]
H
the original pilot symbols.
Minimizing Eq. (2) can be done by multiplying
¤(m)[A(m)d(m)] by a unitary matrix Q(m) :
Q(m)¤(m)[A(m)d(m)] =
"
R(m) p(m)
0 v(m)
#
; (3)
where R(m) is an M £ M upper triangular matrix, p(m)
is a vector of length M , 0 is an (m¡M) £M null matrix
and v(m) is a vector of length m¡M . The least-squares
estimation of w(m) is then given by :
w(m) = R¡1(m)p(m): (4)
Once w(m) has been calculated, it is used to combine the
signal of the user :
d^u(i) = w
H(m)uu(i): (5)
Using the Extended QRD-RLS algorithm described in [14],
the recursion can be done by applying QRD to the following
extended (M+2)£ (2M+2) matrix :"
~R(m+1) ~R
¡H
(m+1)
0 v0(m+1)
#
= Q0(m+1)
"
¤0 ~R(m) (¤0)¡1 ~R
¡H
(m)
~uH(m+1) 0
#
; (6)
| 2 |
Su
b 
m
em
or
y
ba
nk
Su
b 
m
em
or
y
ba
nk
Su
b 
m
em
or
y
ba
nk
Su
b 
m
em
or
y
ba
nk
Su
b 
m
em
or
y
ba
nk
Su
b 
m
em
or
y
ba
nk
Su
b 
m
em
or
y
ba
nk
Su
b 
m
em
or
y
ba
nk
M
em
or
y
B
an
k
A
dd
re
ss
 g
en
er
at
or
Co
nt
ro
l u
ni
t
ad
dr
es
s
po
in
te
r
Co
nt
ex
t
Memory interface
External bus
CPE
CPE
DPE DPE
CPE
CPE
DPE
CPE CPE CPE
DPE
DPE DPE
CPE
CPE
DPE
DPE DPE
CPE
CPE
DPE DPE
CPE
CPE
CPE
DPE
CPE
DPE DPE
DPE DPE
CPE
CPE
CPE
DPE DPE
CPE
DPE DPE
CPE
Fig. 3 CORSAEngine architecture.
where ¤0 = diag(¸; : : : ; ¸; 1), ~uH(m+1) = [uHp (m+1)d
¤
p(m+
1)], v0(m+1) is an auxiliary vector and
~R(m) =
"
R(m) p(m)
0 ®(m)
#
; (7)
where ®(m + 1) is a scalar. After the matrix Q0 has zeroed
~uH(m+1), a scaled version of the new weight vector appears
in ~R
¡H
(m+1) :
~R
¡H
(m+1) =
"
R¡H(m+1) 0
¡wH (m+1)
®(m+1)
1
®(m+1)
#
: (8)
This method has the advantage of avoiding back-
substitution which can be very time-consuming if it has to
be performed frequently.
3. CORSAEngine Architecture
CORSAEngine's architecture is composed of a 2-
dimensional array of processing nodes (PN), a control unit,
a memory bank and an address generator which controls the
algorithms running on the array. The work presented here
was realized on a scaled-down architecture represented in
Fig. 3.
This scaled-down version of CORSAEngine has a 2-by-5
array of PNs. Each PN is composed of two CORDIC Pro-
cessing Elements (CPE) and two Delay Processing Elements
(DPE). The CPEs implement the unfolded CORDIC algo-
rithm which allows pipelining. The pipeline is used to imple-
ment interleaved threads. Di®erent data sets or even com-
pletely unrelated algorithm can be executed in the di®erent
threads. The data types supported by the processor are real
and complex numbers and rotation angles, which are a sub-
set of real numbers. A complex number is the concatenation
of two real numbers. A 20-bit °oating point format, consist-
ing of a 16-bit mantissa and a 4-bit exponent is used for real
numbers.
The control of the operations on the array is done by a
context pointer which is attached to the data by the mem-
ory interface when it is sent from the memory to the array.
Then, every CPE and DPE possesses an instruction table
linking a context pointer to the operation to be done with
the incoming data and the destination of the result. The
result can be sent to any neighboring PN. PNs have horizon-
tal and diagonal connections. A horizontal connection can
hold one complex or two real numbers while a diagonal con-
nection is limited to one real number. As a result, complex
data °ows can be created in the array, giving an e±cient and
°exible way to easily implement systolic algorithm.
4. Implementation of QRD-RLS
In this section, the implementation of the Extended QRD-
RLS algorithm on the array of CORSAEngine is described.
An example of the Extended QRD-RLS systolic array for a 3-
tap weight vector is given in Fig. 4. Each non-zero complex-
valued coe±cient of the matrix ~Rext =
h
~R(m) ~R
¡H
(m)
i
is
represented by one cell. This cell holds the coe±cient value
in its register. Note that the coe±cient in the right-bottom
corner of the matrix is not needed and hence doesn't require
a cell.
4. 1 Cell operations
Two main types of cell can be seen. Border cells are placed
on the left diagonal and produce the required Givens rota-
tion to nullify the input. Inner cells apply this rotation to
their own input and register value. One more distinction
can be made between cells holding the coe±cients of R, p or
R¡H and the last row containing the scaling factor ® and
the scaled weights ¡w=®. The former must multiply the co-
e±cient they hold with the forgetting factor between every
two input, while the latter don't.
Fig. 5 describes how the operations of the cells composing
the array can be implemented using CORDIC units in vec-
toring (VEC), rotation (ROT) and multiplication mode. The
two stages of the complex givens rotation are referred to as
µ-VEC/ROT and Á-VEC/ROT. In the cells of normal rows,
the forgetting factor ¸ must be applied to the register value
after every input. However, as the input of the Á-VEC/ROT
depends on the output of the CORDIC unit applying the
forgetting factor, those two operations cannot be pipelined.
As a result, the Á-VEC/ROT can only operate every two cy-
cle. If the same CORDIC is used for both the Á-VEC/ROT
and the multiplication by ¸, it is fully utilized. But on the
other hand the CORDIC used for the µ-VEC/ROT will only
be used every two cycles thus wasting half of this resource.
As a solution, the same CORDIC can be time-shared by two
adjacent cells for their µ-VEC/ROT as illustrated in Fig. 6.
The left cell ¯rst receives its input and the angle µ, apply
the latter to the former and send the result down to its Á-
VEC/ROT unit. However, the angle µ is is not sent further
but stored in a register of the CORDIC unit. In the next
| 3 |
1w*0w* 2w*
: Inner cell (Rotation)
: Border cell (Vectoring)
0
0
0
 
−1
u*(0) u*(1) u*(2) d*
−w*/ −w*/ −w*/  
R−HR−HR−Hp3R
RR
RR R p1 R
−H
R−Hp2 R
−H
32 333133
22 23 21 22
11131211
0 1 2
Fig. 4 A systolic array for the production of a 3-tap weight vec-
tor using Extended QRD-RLS. Border cells doing complex
vectoring and inner cells doing complex rotation are repre-
sented respectively as round and squared cells. A distinc-
tion is made between cells that must apply the forgetting
factor, in white, and the ones that don't, in gray.
cycle, the same CORDIC unit receives only the input of the
right cell. It will then reuse the angle stored to rotate the
input before sending the result down to the Á-ROT unit of
the right cell. This time, µ is not stored but sent to the next
cell on the right.
As the cells from the last row don't apply the forgetting
factor, it allows the two CORDIC operations to be fully
pipelined. Therefore, successive cells can be connected to
each other in a straightforward manner and no time-sharing
of CORDIC units is required. And, as the registers of the
cells contain a scaled version of the desired weights and the
scaling factor ®, it is possible, by adding one multiplication
to each cell, to scale the weights before they are output. The
structure of those cells is also illustrated in Fig. 5.
4. 2 Partitioning
Now that the cell operations have been mapped to
CORDIC units, it is possible to use them to construct a
full size array for the production of an M -tap weight vector.
Such an array has M2 + 3M + 1 cells, each using from 3
to 5 CORDIC units depending on its type. Consequently it
has to be divided into smaller partitions that will be succes-
sively run on the PN array of CORSAEngine. Because of
the strong vertical dependency in the Extended QRD-RLS
array, it is ¯rst divided into rows, each row having M + 2
cells except the last one with M + 1 cells. To make it ¯t
on the PN array, these rows still have to be subdivided into
segments of a few cells as shown in Fig. 7. Each of these
segments contains 7 cells for a normal row and 3 cells for
the last row. Considering a single row there are two types of
segments: one with a border cell at the beginning and one
containing only inner cells that will be respectively referred
 
VEC
VEC
ROTROT
ROT
Re(u   )out Im(u   )out
VEC
VEC
ROTROT
ROT
Re(t) Im(t)|u  |


in

−1

 

Re(u  ) Im(u  )inin

Re(u  )in inIm(u  )

r |u  |


Re(u  ) inin Im(u  )
in
 
Im(r)Re(r)

 

Re(u  ) Im(u  )inin
Re(t) Im(t)
N
or
m
al
 ro
w
La
st
 ro
w
Rotation modeVectoring mode

Re(w   ) Im(w   )
out out
Re(−w/    ) Im(−w/    ) 
Fig. 5 The CORDIC implementation of the di®erent cells com-
posing a systolic array for Extended QRD-RLS. The mul-
tiplication present are also implemented with CORDIC
units using a multiplication opcode. the values r and ¸
are contained in the registers of the CORDIC units.
to as border and inner segments. In conclusion we have 4
partition types, T, X, Y and Z, with respectively T and
X referring to border and inner segments of a normal row
and Y and Z to border and inner segments of the last row.
Each partition type is implemented on the array as a speci¯c
context pointer.
To run the complete algorithm, it is ¯rst assumed that the
matrix ~Rext, as well as the N new pilots received along with
their local copies in the form of the matrix :
U =
2664
~uH(m+ 1)
...
~uH(m+N)
3775 ; (9)
are stored in the memory bank. A °owchart of the algorithm
is represented in Fig. 8. The 7 ¯rst coe±cients of the ¯rst
row of ~Rext are loaded into the registers of the appropriate
CORDIC units. Then the 7 ¯rst columns of U are processed
through the array con¯gured as partition T. The processed
columns, the modi¯ed coe±cients of ~Rext and the angles pro-
duced are stored back into memory. The next 7 coe±cients
of ~Rext are now loaded into the appropriate CORDIC units
registers and the next 7 columns of U are processed, this
time using a partition type X con¯guration and the angles
produced by the partition T. Processed columns and register
values are sent back to memory at the end of the execution.
This step is repeated until all columns of U have been pro-
| 4 |
   
−1
 

   
 
N
or
m
al
 ro
w
La
st
 ro
w
Inner partitionBorder partition
Type XType T
Type Y Type Z
Fig. 7 The four partition types created. With those four types,
it is possible to process di®erent array size on the same
rectangular systolic array.
Method MF and Comb. PS/QRD-RLS Total
RAKE 200 MFLOPS 0.2 MFLOPS 200 MFLOPS
QRD-RLS 250 MFLOPS 290 MFLOPS 540 MFLOPS
Table 1 Comparison of the computational load of QRD-RLS
based interference mitigation with the conventional
Rake receiver.
cessed. After this, a new matrix U 0 has actually replaced
U in the memory. Now for the second row of the array the
whole process is repeated using U 0 and the second row of
~Rext. Eventually, all the rows of the array are processed in
the same way, only for the last row, types Y and Z replace
types T and X and the number of columns processed at a
time is only 3. The outputs of the last rows are the N weight
vectors corresponding to the N rows of the input matrix U .
5. Performance Evaluation
5. 1 Simulation Results
Simulation of a WCDMA downlink were carried out to
determine the necessary length M of the weight vector w.
The simulated transceiver used is compliant to current 3GPP
standards [1], [2]. Perfect pulse shaping, perfect synchroniza-
tion and no power control were assumed. The channel model
used is the Typical Urban channel from [3]. Simulations were
run with coherence time of 1 frame and then 5 slots, corre-
sponding respectively speeds of 13 km/h and 40 km/h. As
shown in Fig. 9, a length M = 16 is found to be su±cient to
outperform the Rake by as much as an order of magnitude
at high Signal-to-Noise Ratio.
5. 2 Computational Load
To highlight the cost of the performance gain brought by
QRD-RLS based interference mitigation, its computational
load is compared to the one of the conventional Rake re-
ceiver. QRD-RLS uses the 10 and 8 pilots per slot, present
respectively in CPICH and the user data channel [1], to com-
pute a 16-tap weight vector. A 10 ¯ngers Rake receiver is
considered for comparison. It uses the method described
in [8] to obtain a channel estimate with a resolution of 16
paths. It is assumed that synchronization has already been
performed at this stage. Complexity of QRD of an m £ n
MF Pilot MF Data QRD-RLS Combining Total
4% 6.25% 1.15% 0.35% 11.75%
Table 2 The detail of the resource consumption of the di®er-
ent steps of QRD-RLS based interference mitigation on
CORSAEngine.
matrix where m > n is given by 3n2(m¡n=3) [10]. As shown
in Table 1, the complexity of the QRD-RLS based method
is more than 2.5 times the one of the Rake . The main dif-
ference comes from the QRD-RLS algorithm which is com-
putationally intensive compared to the insigni¯cant amount
of computation required by the path search in the Rake .
However, it is shown in the following sections that using the
implementation introduced in Section 4., this complexity can
be easily handled by CORSAEngine.
5. 3 Resource Usage
In this section, the resource usage of QRD-RLS will be
calculated. As shown in Section 5. 1, a 16-tap weight vec-
tor is su±cient to e±ciently mitigate interference. Using the
implementation as described in Section 4. 2, it is possible
to construct an array for the calculation of a 19-tap weight
vector which is therefore su±cient to e±ciently mitigate the
interference. Taking into account the matched ¯ltering of
pilot and data channel as well as the combining, QRD-RLS
interference mitigation for WCDMA consumes 11.75% of the
resources of the scaled-down version of the CORSAEngine.
The resource consumptions of the di®erent blocks of the in-
terference mitigation are detailed in Table 2.
5. 4 Benchmark
The performance of the implementation of QRD-RLS on
CORSAEngine will now be compared to other implementa-
tions on di®erent architectures. The architectures considered
for comparison are : two dedicated hardwares for QRD-RLS,
based on designs conducted on respectively Altera Stratix [5]
and Xilinx Virtex-4 [7] FPGAs, and an Application Speci¯c
Instruction set Processor (ASIP) for matrix computations
(QRD, SVD) using an array of modi¯ed CORDIC units [13].
The performance metric used to compare those architec-
tures is the computational density de¯ned as :
½m£n =
1
tm£n £
P
i
ui £Ai ; (10)
where tm£n is the processing time for a complex matrix of
size m£n in seconds [s], Ai and ui are respectively the chip
area in [Kgates] and a resources utilization factor. The index
i accounts for architectures with totally independent parts.
To make a fair comparison, the CORSAEngine implementa-
tion is adapted to the matrix sizes that were used for eval-
uating performance of the referred architectures [5], [7], [13].
The performance of the CORSAEngine is furthermore used
to normalize the results.
| 5 |
       
       
       
      
      
      



      
      
      



      
      
      



      
      
      
																																																																																																													

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 





      
      
      
       u 11,in
u 21,in

2
u 21,in

1

1

1

1
u 22,in

2

2

2
u 21,in
u 31,in

3
u 31,in

2

2

2

2
u 12,out u 22,out
Re(t1)
Im(t1)PN1PN1

VEC

ROT
 VEC
Re(t1)
Im(t1)
PN1

VEC

r
PN2

ROT

ROT
Re(t2)
Im(t2)PN1

ROT
 VEC
PN2

r

r
Re(t2)
Im(t2)
PN1

VEC

r
PN2

ROT

ROT
PN2 PN2
timeCycle 2Cycle 1 Cycle 3 Cycle 4 Cycle 5
r1r1 r1r1 r1 r1 r1 r1
Re(r2)
Im(r2)
Im(r2)
Re(r2) Re(r2)
Im(r2)
Re(r2)
Im(r2)
Re(r2)
Im(r2)
Re(r2)
u
Im(r2)
11,in

1
u 11,in
u 12,in

1

1

1
Fig. 6 An example of a border (vectoring) and an inner (rotation) cell on a normal row sharing a CORDIC respectively for the
vectoring and rotation of their inputs. The CORDIC is used in vectoring mode during the odd cycles and in rotation
mode during the even cycles. Dashed lines represent values that are kept in a register.
  
U := U’
Move to
last row ?
U1−3
1−2W
Partition Y
Memory Bank
U8−14
W3−5
Partition Z
U1−7
1−6U’
Partition T
U8−14
U’7−13
Partition X
Memory Bank Memory Bank
U’14−20
15−20U o
o
Partition X
Memory Bank
W18−19
19−20U o
o
Partition Z
No
Memory Bank
Yes
Memory Bank
               
Fig. 8 A run of the algorithm for a 19-tap weight vector. The gray rectangle represents the array of CORSAEngine, the dashed
line is for angles and scaling factor that return to memory. Ui-j is the matrix composed of the i
thto the jthcolumns of U .
The matrix W output by the last row contains all the weight vectors produced by the processing of the matrix U through
the Extended QRD-RLS systolic array.
−5 0 5 10 15
10
−3
10
−2
10
−1
U
nc
od
ed
 B
ER
SNR (dB)
Rake, 10 Fingers
QRD−RLS
(a) 13 km/h
−5 0 5 10 15
10
−3
10
−2
10
−1
SNR (dB)
U
nc
od
ed
 B
ER
Rake, 10 Fingers
QRD−RLS, M = 16
(b) 40 km/h
Fig. 9 Performance of QRD-RLS in quarter system load (4 users) with a spreading factor of 16.
Table 3 shows the results of the benchmark. The area esti-
mation of the dedicated hardwares was based on the number
of lookup tables used in the FPGA design. The correspond-
ing number of gates was estimated according to the avail-
able literature [12], [19]. The Altera Stratix design uses two
CORDIC blocks for the QRD and the Embedded Nios Soft
processor for the back-substitution. The performance of the
latest version of the Nios (II) were used [4]. In the case of
the ASIP, as it only handles real-valued QRD-RLS, the fact
that a 128£20 real-valued matrix can be used to represent a
64£ 10 complex-valued matrix is used. For CORSAEngine,
a utilization factor is introduced as an input matrix with 10
columns such as the ones used in the benchmark only use
80.2% of the resource available.
The result of the benchmark shows that CORSAEngine
achieves respectively 50% and 80% more computational den-
sity than the dedicated hardware II (based on Xilinx design)
and the ASIP processor. The dedicated hardware I (based on
| 6 |
Ded. Hardware I [5] Ded. Hardware II [7] ASIP [13] CORSAEngine CORSAEngine
Clock frequency [MHz] 170 250 300 300 300
Matrix size 64£ 10 10£ 10 64£ 10 64£ 10 10£ 10
tm£n [¹s] 268.67 56.76 7.04 10.63 2.89
A [gates] 33480 95310 7M 1150K 1150K
Utilization factor 100% 100% 100% 80.2% 80.2%
½ [update/s/Kgates] 111.22 184.85 20.29 102 375.17
Normalized to CORSA 109.04% 49.27% 19.8% 100% 100%
Table 3 Performance of the di®erent architectures in terms of the computational density ½. The ¯nal result is normalized
in terms of the performance of CORSAEngine to give a fair comparison when the matrix sizes used are di®erent.
Altera design), on the other hand, achieves 9% more com-
putational density. However, it should be noted that the
dedicated hardware I (as well as dedicated hardware I and
the ASIP) implements the weight extraction as back substi-
tution. It was assumed in this benchmark that the weight
are only extracted once after QRD has been done. However,
the CORSAEngine implementation, as it uses the Extended
QRD-RLS as described in Section 2. 2, output one weight
vector after every input row in any case. It will therefore
achieve better performance in term of interference mitiga-
tion when the coherence time of the channel is very short.
6. Conclusion
In this paper a new implementation of QRD-RLS inter-
ference mitigation for WCDMA on CORSAEngine has been
presented. First the necessary complex Givens operations
were mapped to the available CORDIC units in a way that
maximize the utilization of resources. Then the Extended
QRD-RLS systolic array was split into manageable sizes that
¯t on the PN array of CORSAEngine. Simulations were fur-
thermore used to determine the necessary size of the weight
vector to be about 19 taps. Finally, the performance of this
implementation was compared to other available architec-
tures for QRD-RLS and it was shown to achieve at least 91%
of the dedicated hardware performance in terms of compu-
tational density. In conclusion, CORSAEngine was shown
to be able to handle computationally intensive but e±cient
interference mitigation algorithm for WCDMA using only
11.75% of its resources.
Acknowledgments This work was realized between March
2007 and January 2008 while the ¯rst author was an internship stu-
dent at the System IP Core Laboratory, NEC Corporation.
References
[1] 3GPP TS 25.211 V7.1.0, \Physical channels and mapping
of transport channels onto physical channels (fdd)," 2007.
[2] 3GPP TS 25.213 V7.1.0, \Spreading and modulation (fdd),"
2007.
[3] 3GPP TS 25.943 V6.0.0, \Deployment aspects," 2004.
[4] Altera, \Nios II performance benchmark." Altera Data
Sheet, 2007.
[5] D. Boppana, K. Dhanoa, and J. Kempa, \FPGA based
embedded processing architecture for the QRD-RLS algo-
rithm," Field-Programmable Custom Computing Machines,
2004. FCCM 2004. 12th Annual IEEE Symposium on,
pp.330{331, 20-23 April 2004.
[6] G. Bottomley, T. Ottosson, and Y.P. Wang, \A generalized
rake receiver for interference suppression," Selected Areas in
Communications, IEEE Journal on, vol.18, no.8, pp.1536{
1545, Aug 2000.
[7] C. Dick, F. Harris, M. Pajic, and D. Vuletic, \Real-Time
QRD-Based Beamforming on an FPGA Platform," Signals,
Systems and Computers, 2006. ACSSC '06. Fortieth Asilo-
mar Conference on, pp.1200{1204, Oct.-Nov. 2006.
[8] S. Fukumoto, K. Okawa, K. Higuchi, M. Sawahashi, and
F. Adachi, \Path search performance and its parameter op-
timization of pilot symbol-assisted coherent Rake receiver
for W-CDMA mobile radio," IEICE Trans. Fundamentals,
vol.E83-A, no.11, pp.2110{2119, November 2000.
[9] S. Fukumoto, M. Sawahashi, and F. Adachi, \Matched
¯lter-based Rake combiner for widebandDS-CDMAmobile
radio," IEICE Trans. Commun., vol.E81-B, no.7, pp.1384{
1391, July 1998.
[10] G.H. Golub and C.F. Van Loan, Matrix Computations,
3 ed., Johns Hopkins, 1996.
[11] S. Haykin, Adaptive Filter Theory, 4 ed., Prentice Hall,
2002.
[12] H. Krupnova and G. Saucier, \FPGA technology snapshot:
current devices and design tools," Rapid System Prototyp-
ing, 2000. RSP 2000. Proceedings. 11th International Work-
shop on, pp.200{205, 2000.
[13] Z. Liu, K. Dickson, and J. McCanny, \Application-speci¯c
instruction set processor for SoC implementation of modern
signal processing algorithms," Circuits and Systems I: Regu-
lar Papers, IEEE Transactions on, vol.52, no.4, pp.755{765,
April 2005.
[14] T.Z. Mingqian, A.S. Madhukumar, and F. Chin, \QRD-
RLS Adaptive Equalizer and its CORDIC-based Implemen-
tation for CDMA Systems," International Journal on Wire-
less & Optical Communications, vol.1, no.1, pp.25{39, 2003.
[15] J. Mitola III, \Software radios-survey, critical evaluation
and future directions," Telesystems Conference, 1992. NTC-
92., National, pp.13/15{13/23, 19-20 May 1992.
[16] A.F. Molisch, Wireless Communications, IEEE Press, 2005.
[17] K. Seki, T. Kobori, J. Okello, and M. Ikekawa, \A CORDIC-
Based Recon¯grable Systolic Array Processor for MIMO-
OFDM Wireless Communications," Signal Processing Sys-
tems, 2007 IEEE Workshop on, pp.639{644, 17-19 Oct.
2007.
[18] M. Steer, \Beyond 3G," Microwave Magazine, IEEE, vol.8,
no.1, pp.76{82, Feb. 2007.
[19] Xilinx, \An alternate capacity metric for LUT-based FP-
GAs." Xilinx Application Brief, 1997.
| 7 |
