Systolic Arrays for Lattice-Reduction-Aided MIMO Detection by Wang, Ni-Chun et al.
ar
X
iv
:1
10
1.
36
98
v1
  [
cs
.A
R]
  1
7 J
an
 20
11
TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND NETWORKS 1
Systolic Arrays for Lattice-Reduction-Aided
MIMO Detection
Ni-Chun Wang, Ezio Biglieri, and Kung Yao
Abstract—Multiple-input, multiple-output (MIMO) technology
provides high data rate and enhanced QoS for wireless com-
munications. Since the benefits from MIMO result in a heavy
computational load in detectors, the design of low-complexity
sub-optimum receivers is currently an active area of research.
Lattice-reduction-aided detection (LRAD) has been shown to be
an effective low-complexity method with near-ML performance.
In this paper we advocate the use of systolic array architectures
for MIMO receivers, and in particular we exhibit one of them
based on LRAD. The “LLL lattice reduction algorithm” and
the ensuing linear detections or successive spatial-interference
cancellations can be located in the same array, which is con-
siderably hardware-efficient. Since the conventional form of the
LLL algorithm is not immediately suitable for parallel processing,
two modified LLL algorithms are considered here for the systolic
array. LLL algorithm with full-size reduction (FSR-LLL) is one
of the versions more suitable for parallel processing. Another
variant is the all-swap lattice-reduction (ASLR) algorithm for
complex-valued lattices, which processes all lattice basis vectors
simultaneously within one iteration. Our novel systolic array can
operate both algorithms with different external logic controls.
In order to simplify the systolic array design, we replace the
Lovász condition in the definition of LLL-reduced lattice with
the looser Siegel condition. Simulation results show that for LR-
aided linear detections, the bit-error-rate performance is still
maintained with this relaxation. Comparisons between the two
algorithms in terms of bit-error-rate performance, and average
FPGA processing time in the systolic array are made, which
shows that ASLR is a better choice for a systolic architecture,
especially for systems with a large number of antennas.
Index Terms—Lattice reduction, MIMO receivers, systolic
arrays, wireless communications.
I. INTRODUCTION
MULTIPLE-INPUT, multiple-output (MIMO) technol-ogy, using several transmit and receive antennas in a
rich-scattering wireless channel, has been shown to provide
considerable improvement in spectral efficiency and channel
capacity [1]. MIMO systems yield spatial diversity gain, spa-
tial multiplexing gain, array gain, and interference reduction
over single-input single-output (SISO) systems [2]. However,
these benefits come at the price of a computational complexity
N.C. Wang, E. Biglieri, and K. Yao are with the Electrical Engineering
Department, University of California-Los Angeles, Los Angeles, CA 90095,
USA (Address: 56-125B Engineering IV Building, 420 Westwood Plaza, Los
Angeles, CA 90095, USA; e-mail: nichun@ee.ucla.edu; e.biglieri@ieee.org;
yao@ee.ucla.edu). The work of N.C. Wang was partially supported by
National Science Council, Taiwan (R.O.C.). (TMS-094-2-A-002). The work
of K. Yao was partially supported by NSF CENS program CCR-012, NSF
grant EF-0410438, and NSF grant DBI-0754247.
E. Biglieri is also with the Departament de Tecnologies de la Informació i
les Comunicacions, Universitat Pompeu Fabra, Barcelona, Spain. The work of
E.B. was supported by the Spanish Ministry of Education and Science under
Project CONSOLIDER-INGENIO 2010 CSD2008-00010 "COMONSENS".
of the detector that may be intolerably large. In fact, optimal
maximum-likelihood (ML) detection in large MIMO systems
may not be feasible in real-time applications as its complexity
increases exponentially with the number of antennas. Low-
complexity receivers, employing linear detection or successive
spatial-interference cancellation (SIC), are computationally
less heavy, and amenable to simple hardware implementa-
tion [3]–[5]. However, diversity and error-rate performance of
these low-complexity detectors are not comparable to those
achieved with ML.
Lattice-reduction-aided detection (LRAD), which combines
lattice reduction techniques with linear detections or SIC, has
been shown to yield some improvement on error-rate perfor-
mance [6]–[8]. Lenstra-Lenstra-Lovász (LLL) algorithm [9]
is the most widely used lattice reduction algorithm, and can
be applied to complex-valued lattices [10]. The performance
of complex LLL-aided linear detection in MIMO systems
was analyzed in [11]. LLL-based LRAD was also shown to
achieve full receiver diversity [12]. It was also shown that the
LR-aided minimum mean-square-error decoding achieves the
optimal diversity-multiplexing tradeoff [16]. When applied to
MIMO detection, the average complexity of LLL algorithm is
polynomial in the dimension of the channel matrix (the worst-
case complexity could be unbounded [13]). A fixed-complexity
LLL algorithm, which modifies the original version to allow
more robust early termination, has recently been proposed
in [17]. In LRAD, LLL algorithm need be performed only
when the channel state changes. If the channel change rate
is high, or a large number of channel matrices need be pro-
cessed such as in a MIMO-OFDM system, a fast-throughput
algorithm and the corresponding implementation structure is
needed for real-time applications. To obtain this, we first
discuss two variants of LLL algorithm, suitably modified for
parallel processing. Second, we propose a novel systolic array
structure implementing the two modified LLL algorithms and
the ensuing detection methods.
A systolic array [18], [19] is a network of processing
elements (PE) which transfer data locally and regularly with
nearby elements and work rhythmically. In Fig. 1(a), a simple
two-dimensional systolic array is shown as an example. In
this case, the matrix operation D = A · B+C is calculated
by the systolic array, where A, B, C and D are 2 × 2
matrices. The operation of each PE is shown in Fig. 1(b).
The inputs of the systolic array, the entries of matrices A
and C, are pipelined in a slanted manner for proper timing.
Since all PEs can work simultaneously, the latency is shorter
than with a single processor system, and the results of D
are outputted in parallel. Systolic algorithms and the corre-
2 TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND NETWORKS
sponding systolic arrays have been designed for a number of
linear algebra algorithms, such as matrix triangularization [20],
matrix inversion [21] , adaptive nulling [22], recursive least-
square [23], [24], etc. An overview of systolic designs for
several computationally demanding linear algebra algorithms
for signal processing and communications applications was
recently published in [25]. While systolic arrays allow simple
parallel processing and achieve higher data rates without
the demand on faster hardware capabilities, the existence of
multiple PEs implies a higher cost of circuit area. Thus, time
efficiency is traded off with circuit area in hardware design.
For the application we are advocating in this paper (MIMO
detectors), systolic arrays offer an attractive solution, as we
must cope with a high computational load while requiring
high throughput and real-time operation. Systolic arrays have
been previously suggested for MIMO applications. In [26], the
authors proposed a universal systolic array for adaptive and
conventional linear MIMO detectors. In [27], a reconfigurable
systolic array processor based on coordinate rotation digital
computer (CORDIC) [28] is proposed to provide efficient
MIMO-OFDM baseband processing. Also, matrix factoriza-
tion and inversion are widely used in MIMO detection, with
systolic arrays used to increase the throughput [5], [29].
In this paper, our objective is to provide a novel systolic
array design for LLL-based LRAD. The ideas are described
from a system-level perspective instead of detailed discussion
on the hardware-oriented issues. The system model and how
LRAD works are briefly described in Section II. Since the
original LLL algorithm [8]–[15] is not designed for parallel
processing, and hence is not suitable for systolic design, two
modified LLL algorithms are considered here (Section III).
Note that we are not claiming the two algorithms works
better than the original LLL in terms of the LRAD bit-error-
rate (BER) performance. First, we improve on the format
of conventional LLL algorithm by altering the flow of size-
reduction process (we call it “LLL with full size-reduction,”
or FSR-LLL). FSR-LLL is more time-efficient in parallel
processing than the conventional format, and hence suitable
for systolic design. We also consider a variant of the LLL
algorithm called “all-swap lattice reduction (ASLR),” which
was first proposed in [30] for real lattices, and derive its
complex-number version. A crucial difference between ASLR
and LLL algorithm is that with ASLR all lattice basis vectors
are simultaneously processed during a single iteration. In both
algorithms, in order to simplify the systolic array operations
we replace the original Lovász condition [9] of LLL algorithm
with the slightly weaker Siegel condition [31]. Surprisingly, for
LR-aided linear detections the BER performance with Siegel
condition under the proper parameter setting is just as good as
the one using Lovász condition. However, for LR-aided SIC,
the performance with Lovász condition is still slightly better
due to less error propagation. The mapping from algorithm
to systolic array is introduced in Section IV. In our design,
ASLR and FSR-LLL can be operated on the same systolic-
array structure, but the external logic controller is also required
to control the algorithm flow. Additionally, since ASLR was
originally designed for parallel processing, a systolic array
running ASLR is on the average more efficient than one
n+1
b11 b12
b21 b22
a11
a21 a12
a22
c11c21
c12c22
d11d21
d12d22
123
1
2
3
time
n+2n+3
(n: latency) 
(a) 
in
out
outin
out in
out in in
a =a
c =c +a b
(b) 
Fig. 1.  (a) Two-dimensional systolic array performing matrix 
calculation D = A B + C , where , , ,ij ij ij ija b c d  are the (i,j) entries of the 
matrix , , , and . (b) The operation of each processing element. 
44
1312
21 23
14
24
11
22
31 32 3433
41 42 43
(a) 
(*)
    
 "#"
 ;   ( , )
"true", if   =
"false",
:= - ;   ,   
in
out out
out out
in
out outin in in in
If  m ="#"
m c
d r x r t
d rswap
otherwise
(Default)
t t y x y y x x
' '
' '
! ' '
Data mode
Size Reduction mode 
 ( )
Diagonal cell Dii
in
i nout
q, t, rin out
in
out
outout
wap
out
Off-diagonal cell Oij
q, t, rin out
out
in inout
out in
 ( , );   
 If  carries "*":
       /
       : - ,  : -
       ;   
 If  doesn't carry "*":
       
in
out out in
in
in
in in
out out in
in
(If  c ="#")
x r t c c
(Default)
r x
r r x t t x
y x x
* *
' '
' ! ' !
' '
Data mode
Size Reduction mode 
 !
" #$ %
: , :
       ;      
in in in in
out outin in
r y x t t y x
y y x x
' ) ! ' ) !
' '
(b) 
Fig. 4.  (a) The systolic array for the linear LRAD of 4 4  MIMO system. (b) 
The operations of diagonal and off-diagonal cells in the systolic array. (“*” is 
an indicator bit used to control the flow of the algorithm, as explained in 
Section IV.A) 
10 15
0.2
0.4
0.6
0.8
Orhtogonality defect 
E
m
p
ir
ic
a
l 
c
u
m
u
la
ti
v
e
 p
ro
b
a
b
ili
ty
 f
u
n
c
ti
o
n
LLL
ASLR
FSR-LLL
No reduction
=.51
=.75
=.99
Fig. 3.  The empirical cumulative probability functions of the orthogonality 
defect  for the 4 4  channel matrices under three different reduction 
algorithms. 
H = H T
Lattice Reduction
Rounding
y
( )
Fig. 2.  Block diagram of linear lattice-reduction-aided detection 
(a)
n+1
11 12
21 22
11
21 12
22
1121
1222
1121
1222
1
2
3
time
n+2n+3
(n: latency) 
(a) 
b
ain
aout
coutcin
!
out in
out in in
a =a
c =c +a b
 
(b) 
Fig. 1.  (a) Two-dimensional systolic array performing matrix 
calculation D = A B + C , where , , ,ij ij ij ija b c d  are the (i,j) entries of the 
matrix , , , and . (b) The operation of each processing element. 
44
1312
21 23
14
24
11
22
31 32 3433
41 42 43
(a) 
(*)
    
 "#"
 ;   ( , )
"true", if   =
"false",
:= - ;   ,   
in
out out
out out
in
out outin in in in
If  m ="#"
m c
d r x r t
d rswap
otherwise
(Default)
t t y x y y x x
' '
' '
! ' '
Data mode
Size Reduction mode 
 ( )
Diagonal cell Dii
in
i nout
q, t, rin out
in
out
outout
wap
out
Off-diagonal cell Oij
q, t, rin out
out
in inout
out in
 ( , );   
 If  carries "*":
       /
       : - ,  : -
       ;   
 If  doesn't carry "*":
       
in
out out in
in
in
in in
out out in
in
(If  c ="#")
x r t c c
(Default)
r x
r r x t t x
y x x
* *
' '
' ! ' !
' '
Data mode
Size Reduction mode 
 !
" #$ %
: , :
       ;      
in in in in
out outin in
r y x t t y x
y y x x
' ) ! ' ) !
' '
(b) 
Fig. 4.  (a) The systolic array for the linear LRAD of 4 4  MIMO system. (b) 
The operations of diagon l and off-diagonal cells in the systolic array. (“*” is 
an indicator bit used to control the flow of the algorithm, as explained in 
Section IV.A) 
10 15
0.2
0.4
0.6
0.8
Orhtogonality defect 
E
m
p
ir
ic
a
l 
c
u
m
u
la
ti
v
e
 p
ro
b
a
b
ili
ty
 f
u
n
c
ti
o
n
LLL
ASLR
FSR-LLL
No reduction
=.51
=.75
=.99
Fig. 3.  The empirical cumulative probability functions of the orthogonality 
defect  for the 4 4  channel matrices under three different reduction 
algorithms. 
H = H T
Lattice Reduction
Rounding
y
( )
Fig. 2.  Block di gram of linear lattice-reduction-aided detection 
(b)
Fig. 1. (a) Two-dimensional systolic array performing matrix calculation
D = A ·B+C , where aij , bij , cij , dij are the (i, j) entries of the matrix
A,B,C, and D. (b) The operation of each processing element.
running FSR-LLL. Simulation results also show that ASLR-
based LRAD has a BER performance very similar to that of
LLL algorithm. Comparison between our proposed design and
the conventional LLL in FPGA implementation shows that
the systolic arrays do provide faster processing speed with a
moderate increase of hardware resources. After the channel
state matrix has been lattice-reduced, linear detectors or SIC
can also be implemented by the same systolic array without
any extra hardware cost, which is discussed in Section V.
The following notations are used throughout the remain-
ing sections. Capital bold letters denote matrices, and lower
case bold letters denote column vectors. F r example, X =
[x1,x2, · · · ,xm] is a matrix with m columns of x1 to xm. The
entry of a matrix X at positi n (i, j) is d noted by xi,j , and
the kth element of a vector x is denoted by xk. The submatrix
(subvector) formed from the ath to bth rows and mth to
nth columns of X is denoted by Xa:b,m:n. The notations
(·)+, (·)T , (·)H and (·)† are used for conjugate, transpose,
Hermitian transpose, and Moore-Penrose pseudo-inverse of a
matrix, respectively. ‖x‖ is the Euclidean norm of the vector
x. ℜ(·) and ℑ(·) are the real and imaginary parts of a complex
number, respectively. ⌈x⌋ indicates the closest integer to x. If
x is a complex number, then ⌈x⌋ = ⌈ℜ(x)⌋ + i ⌈ℑ(x)⌋. Im
and 0m are m×m identity and null matrices, respectively.
II. LATTICE-REDUCTION-AIDED DETECTION
A. System Model
We consider a MIMO system with m transmit and n re-
ceive antennas in a rich-scattering flat-fading channel. Spatial
multiplexing is employed, so that data are transmitted as m
substreams of equal rate. These substreams are mapped onto
M-ary QAM symbols. Let x denote the complex-valued m×1
transmitted signal vector, and y the complex-valued n × 1
received signal vector. The baseband model for this MIMO
system is
y = Hx+ n, (1)
where H is the n×m channel matrix: its entries are uncorre-
lated, zero-mean, unit-variance complex circularly symmetric
WANG et al.: SYSTOLIC ARRAYS FOR LATTICE-REDUCTION-AIDED MIMO DETECTION 3
Gaussian fading gains hij , and n is the n × 1 additive
white complex Gaussian noise vector with zero mean and
E[nnH ] = σ2I. The average power of each transmitted signal
xi is assumed to be normalized to 1, i.e., E[xxH ] = I.
Additionally, we assume that the channel matrix entries are
fixed during each frame interval, and the receiver has perfect
knowledge of the realization of H.
B. Linear Detection
In linear detection, the estimated signal xˆ is computed by
first premultiplying the received signal y by an n×m “weight
matrix” W. The two most common design criteria for W are
zero-forcing (ZF) and minimum mean-square error (MMSE).
In zero-forcing detection, the weight matrix WZF is set to be
the Moore-Penrose pseudo-inverse H† of the channel matrix
H, i.e.,
xˆZF = WZFy = H
†y = x+H†n. (2)
It is known that zero-forcing detection suffers from the noise
enhancement problem, as the channel matrix may be ill-
conditioned. Under the MMSE criterion, the weight matrix
W is chosen in such a way that the mean-squared-error
between the transmitted signal x and the estimated signal xˆ
is minimized. The mean-squared-error (MSE) is defined as
MSE
∆
= E[‖x− xˆ‖2] = E
[
(x−Wy)H(x−Wy)
]
. The
weight matrix W that minimizes the MSE is
WMMSE = (H
HH+ σ2I)−1HH , (3)
It is well known that, as σ2 → 0, the weight matrix WMMSE
approaches WZF . Since WMMSE takes noise power into
consideration, MMSE detection suffers less from noise en-
hancement than ZF detection. In [8], [32], it is shown that
MMSE is equivalent to ZF in an extended system model, i.e.,
xˆMMSE = WMMSEy = H
†y = (HHH)−1HHy, (4)
where
H =
[
H
σIm
]
andy =
[
y
0m×1
]
. (5)
Comparing (2) with (4), it follows that the two detection
methods can share the same structure in systolic-array im-
plementation, which we shall elaborate upon in Section IV.
C. Lattice-Reduction-Aided Linear Detection
The idea underlying lattice reduction is the selection of
a basis vector for the lattice under some goodness crite-
rion [33]. We first observe that, under the assumption of
QAM transmission, the transmitted vector x is an integer point
of a square lattice (after proper scaling and shifting of the
original QAM constellation). By interpreting the columns of
the channel matrix H as a set of lattice basis vectors, Hx is
also a lattice point. If two basis sets H and H˜ are related by
H˜ = H · T, T a unimodular matrix, they generate the same
set of lattice points. In MIMO detection, the objective of the
lattice reduction algorithm is to derive a better-conditioned
channel matrix H˜. In this paper, we focus on the complex-
valued LLL algorithm [10], [11]. More details about the LLL
algorithm will be provided in Section III.
                                                                                                       
                                                                                                       
                                                                                                       
                                                                                                       
                                                                                                       
                                                                                                       
                                                                                                       
H = H Tɶ i
†Hɶ+
Lattice Reduction
Rounding
xˆ ˆ qx ˆ LRxyx
n
H T Q( )
Fig. 2. Block diagram of linear lattice-reduction-aided detection
After lattice-reduction of the channel matrix, we can per-
form the linear detection, as described in Section II-B, based
on H˜. Consider ZF first. The estimated signal xˆ can be written
as
xˆ = H˜†y = H˜†
(
(HT)(T−1x) + n
)
= T−1x+ H˜†n. (6)
Since xˆ is no longer an integer vector, the simplest but subopti-
mal way of estimating T−1x is to round xˆ element-wise to the
nearest integer. Let xˆq be an estimate of T−1x after rounding.
The final step is to transform xˆq back into an estimate of x,
which is done by multiplying xˆq by the unimodular matrix
T. Since the vector entries after the transformation could lie
outside the QAM constellation boundary, we finally quantize
those points outside the boundary to the closest constellation
point, i.e., xˆLR = Q(Txˆq). Fig. 2 shows the block diagram
of LR-aided ZF detection for MIMO. It is easy to see that
the same structure can also be used for MMSE detection, by
simply replacing H and y with the extended matrix H and the
vector y defined in (5), respectively. The remaining operations
are the same as in ZF.
D. LR-Aided Successive Spatial-Interference Cancellation
Besides being suitable linear detection systolic design
can be used to exploit the regularity of successive spatial-
interference cancellation (SIC). In [8], it is shown that LR-
aided SIC outperforms linear detection methods, while ex-
hibiting a complexity comparable to linear detection. The
LR-aided SIC can be conveniently described in terms of the
QR decomposition of the reduced channel matrix. Here we
summarize briefly the procedure of LR-aided ZF-SIC only, as
the LR-aided MMSE-SIC can be derived in a similar way.
Let the QR decomposition of the reduced channel matrix be
H˜ = Q˜R˜. First, multiply Q˜H to y in (1), we obtain
v
∆
= Q˜Hy = R˜z+ Q˜
H
n, where z = T−1x. (7)
Then we can solve for z layer by layer starting from the bottom
to the top, i.e.
zˆi =
⌈
vi
r˜ii
⌋
, v := v − (R˜1:i,i)zˆi, (8)
where i starts from m to 1 and zˆi is the estimate of each entry
of z.
III. TWO VARIANTS OF LLL ALGORITHM
In this section, we introduce two variants of LLL algorithm
which are more time-efficient than the classical LLL algorithm
4 TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND NETWORKS
when using parallel processing. Since systolic arrays yield a
simple form of parallel processing, our systolic array design
for LRAD is based on these two algorithms.
We begin the discussion with the definition of LLL-reduced
lattice. Let H (an n×m matrix) be a set of lattice basis vectors,
with QR decomposition H = QR. The basis set H is complex
LLL-reduced with parameter δ (1/2 < δ < 1), if the following
two conditions are satisfied [10], [11]:
(a)
µi,j
∆
=
ri,j
ri,i
, |ℜ(µi,j)| ≤
1
2
and |ℑ(µi,j)| ≤
1
2 , 1 ≤ i < j ≤ m,(9)
(b)
δ −
∣∣∣∣ ri−1,iri−1,i−1
∣∣∣∣
2
≤
|ri,i|
2
|ri−1,i−1|
2 , 2 ≤ i ≤ m. (10)
The second condition in (10) is called the Lovász condition,
and the process to make the basis set satisfy (9) is called size
reduction. In the standard form of LLL algorithm considered
in the literature [8]–[15], size reduction applies only to one
column of H during a single iteration. Now, systolic arrays,
allowing simple parallel processing, are capable of updating
the whole matrix without introducing extra delays. Hence, our
proposed systolic array is first designed based on the LLL
algorithm in a different form, which we call it “LLL algorithm
with full size reduction (FSR-LLL).”
A. LLL algorithm with Full Size Reduction (FSR-LLL)
Table I shows the LLL algorithm with full size reduction.
In the following discussion, we refer to the lines in Table
I. There are three main differences between FSR-LLL and
the conventional complex LLL algorithm1, although the lattice
reduced bases from both algorithms are still the same. First,
the full size reduction (lines 4~10) is executed in each iteration
of the while loop (line 3), which means that all columns of
R and T are size-reduced at the beginning of each iteration.
The advantage here is that, once condition (10) is also fulfilled
after full size reduction (i.e., no k′ is found in line 11), then
the FSR-LLL can immediately end the process (line 20). For
example, suppose that k equals 3 at current iteration. Since all
columns in R and T are size-reduced after full size reduction,
if no k′ can be found in line 11 (a search that a systolic array
can make in parallel), then no further processing is needed
in FSR-LLL. However, in the conventional LLL format, the
process will end until columns 3 to m are sequentially size-
reduced. With a systolic-array implementation, FSR-LLL is
faster, and its efficiency is especially apparent when m is large.
The second difference is that the Givens rotation (lines 13~16)
is executed before the column swap (line 17). This is because
the Givens rotation process can work in parallel with full size
reduction, whereas the columns swap cannot. This point will
be made clear in Section IV-A. Third, the QR decomposition
QHH = R is considered as the input of the algorithm, instead
1For comparison, the interested readers can refer to the Table I in [11] for
the conventional complex LLL algorithm. The Table I and II in this paper are
presented in the similar format as the one in [11]. All the simulation results
related to the conventional LLL in this paper are also based on the same table.
TABLE I
LLL ALGORITHM WITH FULL SIZE REDUCTION
TABLE I LLL ALGORITHM WITH FULL SIZE REDUCTION
, , ,
  ,  
(1)    Initialization 
(2)    2
(3)    While 
(4)       for  , , 2
(5)          for  -1, ,1
(6)             
H
H H
m
i j i j i i
INPUT
OUTPUT
k
k m
j m
i j
r r!
" "
"
#
"
"
"
Q , R
Q Q , R R T
T = I
Full Size Reduction
  
!
!
" #
$% &
1: , 1: , , 1: ,
1: , 1: , , 1: ,
2
1, 1, 1
(7)             :
(8)             :
(9)          end
(10)     end
(11)     Find the smallest  between ~  
            such that 
i j i j i j i i
m j m j i j m i
k k k k k
k k m
r r r
!
!
$ % % % % %& & &
" &
" &
%
& '
R R R
T T T
'
2 2
, 1, 1
1 1, 1: ,
2 , 1: ,
1 2
2 1
'-1: ', ' 1: ' 1: ', ' 1: '-1: '
:
(12)     If   exists 
(13)   
(14)       
(15)
(16) : ,  
k k k
k k k k k
k k k k k
H
k k k m k k k m k k
r
k
r r
r r
(
(
( (
( (
% % %& &
% % % % %& &
% % % % %&
)
& & &
%
"
"
* +
" , -
&. /
" 0
Givens Rotation
G
R G R Q ,1: ' 1: ',1: :
(17)         Swap columns -1 and  in  and 
(18)         : max{ -1,2}
(19)     else
(20)         : 1
(21)     end
(22)   end
H
n k k n
k k
k k
k m
&" 0
% %
%"
" )
G Q
Column Swap
R T
TABLE II LL WAP ATTICE EDUCTION LGORITHM
  ,  
(1)    Initialization 
(2)    =EVEN
(3)    While (any swap is possible in lines (9) or (16) )
(4)        Execute lines 4 ~ 10 in T
H H
INPUT
OUTPUT
order
" "
Q , R
Q Q , R R T
T = I
Full Size Reduction
2 2 2
1, 1, 1 , 1, 1
able I
(5)        If =EVEN
(6)            If  for all even 
(7)                go to line (13)
(8)            else
(9)      
k k k k k k k k
order
r r r r k& & & & && #
Givens Rotation and Column Swap
2 2 2
1, , 1, 1
          Execute lines 13~17 in Table I
                     for all even  between 2 ~  
                     such that 
(10)               ODD
(11)           end
(12)       
k k k k k k
k m
r r
order
$ ! & & && '
2 2 2
1, 1, 1 , 1, 1
else
(13)           If  for all odd 
(14)               go to line (6)
(15)           else
(16)               Execute lines 13~17 in Table I
                      for all o
k k k k k k k kr r r r k& & & & && #
2 2 2
1, , 1, 1
dd  between 2 ~  
                      such that 
(17)                =EVEN
(18)            end
(19)        end
(20)   end
k k k k k k
k m
r r
order
$ ! & & && '
TABLE III FPGA MPLEMENTATION ESULTS
Target  
Algorithm
ASLR FSR-LLL CLLL [11] 
Device Virtex 5 Virtex 6 Virtex 5 Virtex 6 Virtex 4 Virtex 5
Slices
2322
/20480
1812
/20000
2335
/20480
1798
/20000
3617
/67584
1712
/17280
Clock
Frequency
160MHz 249MHz 155MHz 247MHz
140
MHz
163 MHz
Avg.  
cycles(time) 
per channel 
matrix 
80 (SQRD) 84 (SQRD) 
130 (SQRD) 
500.0ns 321.3ns 541.9ns 340.1ns
146 (QRD) 164 (QRD) 
928.6ns 797.5ns
912.5ns 586.3ns 1058.1ns 664.0ns
part number: XC5VFX130T       part number: XC6VLX130T 
of H = QR. From line 16, the Givens rotation matrix G
applies to the same two rows of QH and R, which simplifies
the design of the systolic array. Additionally, after FSR-LLL,
Q˜H is ready for calculating the pseudoinverse of H˜ for linear-
detection.
B. All-Swap Lattice Reduction (ASLR) Algorithm
The ASLR algorithm is a variant of the LLL algorithm, and
was first proposed for real number lattices only [30]. Table II
describes its extension to a complex version. One significant
difference between FSR-LLL and ASLR is that every pair
of columns k and k − 1 with even (or odd) index k could
be swapped simultaneously. The algorithm begins with full
size reduction, which is the same as FSR-LLL. Givens-rotation
and column-swap operations (same as in Table I, lines 13~17)
should be executed on all possible even (odd) k that violate
the condition in (10), and then start another iteration with the
indicator variable “order” set to odd (even). If condition (10)
holds for all even (odd) k, Givens rotation and columns swap
will not be executed. Meanwhile, we can immediately check
for all odd (even) k instead. Matrix R is already full-size
reduced, with no need to start the next iteration with full size
WANG et al.: SYSTOLIC ARRAYS FOR LATTICE-REDUCTION-AIDED MIMO DETECTION 5
TABLE II
ALL SWAP LATTICE REDUCTION ALGORITHM
TABLE LLL LGORITHM WITH ULL IZE EDUCTION
, , ,
  ,  
(1)    Initialization 
(2)    2
(3)    While 
(4)       for  , , 2
(5)          for  -1, ,1
(6)             
H H
i j i j i i
INPUT
OUTPUT
k m
j m
i j
r r
" "
Q , R
Q Q , R R T
T = I
Full Size Reduction
" #
% &
1: , 1: , , 1: ,
1: , 1: , , 1: ,
1, 1, 1
(7)             :
(8)             :
(9)          end
(10)     end
(11)     Find the smallest  between ~  
            such that 
i j i j i j i i
m j m j i j m i
k k k k k
k k m
r r r% % % % %& & &
" &
" &
& '
R R R
T T T
2 2
, 1, 1
1 1, 1: ,
2 , 1: ,
1 2
2 1
'-1: ', ' 1: ' 1: ', ' 1: '-1: '
(12)     If   exists 
(13)   
(14)       
(15)
(16) : ,  
k k k
k k k k k
k k k k k
k k k m k k k m k k
r r
r r
( (
( (
% % %& &
% % % % %& &
% % % % %
& & &
* +
, -
. /
" 0
Givens Rotation
R G R Q ,1: ' 1: ',1: :
(17)         Swap columns -1 and  in  and 
(18)         : max{ -1,2}
(19)     else
(20)         : 1
(21)     end
(22)   end
n k k n
k k
k k
k m
" 0
% %
" )
G Q
Column Swap
R T
TABLE II ALL SWAP LATTICE REDUCTION ALGORITHM
  ,  
(1)    Initialization 
(2)    =EVEN
(3)    While (any swap is possible in lines (9) or (16) )
(4)        Execute lines 4 ~ 10 in T
H
H H
m
INPUT
OUTPUT
order
" "
Q , R
Q Q , R R T
T = I
Full Size Reduction
  
2 2 2
1, 1, 1 , 1, 1
able I
(5)        If =EVEN
(6)            If  for all even 
(7)                go to line (13)
(8)            else
(9)      
k k k k k k k k
order
r r r r k$ & & & & && #
Givens Rotation and Column Swap
2 2 2
1, , 1, 1
          Execute lines 13~17 in Table I
                     for all even  between 2 ~  
                     such that 
(10)               ODD
(11)           end
(12)       
k k k k k k
k m
r r
order
$ ! & & && '
"
2 2 2
1, 1, 1 , 1, 1
else
(13)           If  for all odd 
(14)               go to line (6)
(15)           else
(16)               Execute lines 13~17 in Table I
                      for all o
k k k k k k k kr r r r k$ & & & & && #
2 2 2
1, , 1, 1
dd  between 2 ~  
                      such that 
(17)                =EVEN
(18)            end
(19)        end
(20)   end
k k k k k k
k m
r r
order
$ ! & & && '
TABLE III FPGA MPLEMENTATION ESULTS
Target  
Algorithm
ASLR FSR-LLL CLLL [11] 
Device Virtex 5 Virtex 6 Virtex 5 Virtex 6 Virtex 4 Virtex 5
Slices
2322
/20480
1812
/20000
2335
/20480
1798
/20000
3617
/67584
1712
/17280
Clock
Frequency
160MHz 249MHz 155MHz 247MHz
140
MHz
163 MHz
Avg.  
cycles(time) 
per channel 
matrix 
80 (SQRD) 84 (SQRD) 
130 (SQRD) 
500.0ns 321.3ns 541.9ns 340.1ns
146 (QRD) 164 (QRD) 
928.6ns 797.5ns
912.5ns 586.3ns 1058.1ns 664.0ns
part number: XC5VFX130T       part number: XC6VLX130T 
reduction (Table II, line 7 or 14). If neither an even nor odd
k violates condition (10) after full size reduction, the ASLR
process ends.
C. Replacing Lovász condition with Siegel condition
From the previous discussion, it is clear that all basis vectors
are size reduced within one processing iteration of full size
reduction. Additionally, according to line 11 in Table I and
lines 6 and 13 in Table II, the lattices processed by FSR-LLL
and ASLR both satisfy the Lovász condition in (10). There-
fore, we can conclude that these two algorithms also generate
LLL-reduced lattice. Consequently, like the conventional LLL,
FSR-LLL-aided and ASLR-aided detection also achieves full
receive diversity in MIMO system [11], [12].
The Lovász condition involves two diagonal elements and
one off-diagonal element in the matrix R. In order to simplify
the data communication between processing elements in the
systolic array, we relax the Lovász condition by replacing it
with
δ −
1
2
≤
|ri,i|
2
|ri−1,i−1|
2 , 2 ≤ i ≤ m, (11)
where δ lies in the range (1/2, 1) , the same as for Lovász con-
dition. The condition (11) is also called Siegel condition [31],
and it is weaker than the Lovász condition because
δ −
1
2
≤ δ −
∣∣∣∣ ri−1,iri−1,i−1
∣∣∣∣
2
≤
|ri,i|
2
|ri−1,i−1|
2 , 2 ≤ i ≤ m. (12)
The first inequality follows from (9). Similar approximation
as in (11) can be found in [34]. The advantage of using this
new condition is that only two neighboring diagonal elements
of R are involved. We will have more discussion on the
impact of designing systolic array with this new condition
in Section IV. Another advantage comes from the fact that
the new condition check can be done by taking the square-
root in (11). In hardware implementation, it implies that we
can save precision bits by storing |ri,i|/|ri−1,i−1| rather than
|ri,i|
2
/
|ri−1,i−1|
2
. Additionally, the condition check can be
done without a division, simply by comparing the value of
|ri,i| and
√
δ − 1/2 |ri−1,i−1| , where
√
δ − 1/2 is a pre-
computed constant once δ is determined. In the balance of
this paper, when we refer to FSR-LLL and ASLR we mean
FSR-LLL and ASLR with Siegel condition.
Since Siegel condition is weaker than Lovász condition,
one might expect the performance of the lattice reduction
algorithm with condition (11) to be worsened. Yet, by a
proof similar to that in [11], [12] we can show that the
LLL algorithm with Siegel condition also achieves maximum
receive diversity in MIMO systems. In the proof of LLL-aided
detection achieving full diversity [11], [12], the key step and
the only step involving the LLL-reduced conditions is that the
orthogonality defect κ (κ ≥ 1) of the LLL-reduced basis set
H is upper bounded by
κ
∆
=
∏m
i=1 ‖hi‖
2
det (HHH)
≤ 2−m
(
2
2δ − 1
)m(m+1)
2
, (13)
where hi’s are the columns of H. In particular, (13) also
holds for the lattices reduced by LLL algorithm with Siegel
condition. This can be justified by the same proof as in [11,
Appendix B], whose details will be omitted in this paper.
Hence, the LLL algorithm with the Lovász condition replaced
by the Siegel condition also achieves maximum diversity in
MIMO system. However, achieving maximum receive diver-
sity does not automatically imply that the bit-error-rate (BER)
performance is as good as using the conventional LLL algo-
rithm. One can easily observe that if δ is very close to 1/2 ,
condition (11) is almost always true. Thus, the Givens rotation
and column swap steps in the reduction algorithm would
seldom be performed, which causes the BER performance to
be much worse than with conventional LLL. On the contrary,
as δ approaches 1 one can expect the performance of FSR-
LLL and ASLR to be closer to the conventional LLL. In Fig. 3,
we show the empirical cumulative probability functions of the
orthogonality defect κ for 4× 4 channel matrices under three
different reduction algorithms. The results of FSR-LLL and
ASLR overlap for all three values of δ, which implies that the
effects of these two method on lattice reduction are almost the
same. As δ = 0.99, FSR-LLL and ASLR give a result close
to the LLL with δ = 0.75, which is a very common setting
as documented in previous works [8], [9], [12]. For δ = 0.51
and 0.75, the gap between LLL and FSR-LLL (ASLR) is much
6 TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND NETWORKS
0 5 10 15
0
0.2
0.4
0.6
0.8
1
Orhtogonality defect κ
Em
pi
ric
al
 c
um
ul
at
ive
 p
ro
ba
bi
lity
 fu
nc
tio
n
 
 
LLL
ASLR
FSR−LLL
No reduction
δ=.51
δ=.75
δ=.99
Fig. 3. The empirical cumulative probability functions of the orthogonality
defect κ for the 4 × 4 channel matrices under three different reduction
algorithms.
larger than for δ = 0.99. In section IV-C, we will show that
for δ equal to 0.99, the BER performance of LR-aided linear
detections using FSR-LLL and ASLR is not worse than the one
using the conventional LLL with the same δ value. Based on
these results, in our systolic array design we choose δ = 0.99.
IV. SYSTOLIC ARRAY FOR TWO LATTICE-REDUCTION
ALGORITHMS
From Fig. 2, the whole process of LRAD can be viewed
as taking two steps: lattice reduction for the channel matrix,
and detection. In this section, we exhibit our systolic array
design for LLL lattice reduction algorithm. The ensuing linear
detection or SIC on systolic array will be discussed in Section
V. In the following discussion, we assume that the channel
matrix has been QR decomposed. It is known that QRD
can be implemented in systolic array based on a series of
Givens rotations, since Given rotations can be executed in
a parallel manner [20]–[22]. Since the conventional systolic
array for QRD usually contains square root operations, which
are computationally intensive in hardware implementation,
a square-root-free systolic QRD based on Squared Givens
rotations (SGR) can be used (the interested readers can refer
to [29], [35]). In [8], it is also shown that the sorted QRD
(SQRD) can reduce the number of column swaps in the LLL
algorithm, and hence leads to less processing time. However,
it also requires higher hardware complexity and latency to
implement SQRD than the conventional QRD [36].
A. Systolic Array for FSR-LLL
In the following, we assume a 4 × 4 MIMO system (i.e.,
m = 4, n = 4) and illustrate the proposed systolic algorithm
in three parts: full size reduction, Givens rotation, and column
swap.
1) Full Size Reduction: The systolic array for the remaining
parts of LRAD is shown in Fig. 4(a) . Four different kinds of
PEs are used, viz., diagonal cells, off-diagonal cells, vectoring
cells, and rotation cells. For the full size reduction part, only
diagonal and off-diagonal cells are needed: the operations
of these two types of PEs are shown in detail in Fig. 4(b).
The vectoring cell and rotation cell will be introduced with
the Givens rotation description. There is a slight difference
between the off-diagonal cells in the upper-triangle part and
those in the lower-triangle part. Fig. 4(b) shows only the off-
diagonal cell in the upper-triangle part. Those off-diagonal
cells in the lower-triangle part have yin and cin come from
the top, while cout leaves from the bottom. Except for this
minor difference in the data interface, the operations are
the same as the off-diagonal cells in the upper-triangle part.
Additionally, in Fig. 4(b) the dotted lines represent the logic
control signals transmitted between cells, and the solid lines
represent the data transmitted. To initialize the process, each
element of the matrices R and QH (denoted as r and q,
respectively, in Fig. 4(b)) from QR decomposition are stored
in the PE at the corresponding position. For example, qi,i and
ri,i are stored in the corresponding diagonal cell Dii. The off-
diagonal elements qi,j and ri,j are stored in the off-diagonal
cell Oij . Additionally, the elements of the unimodular matrix
T (denoted as t in Fig. 4(b)) are also stored in the arrays, with
T initially set to the identity matrix.
Fig. 5 shows the overall processes of the full size reduction
in the systolic array. In this stage, two major processing modes
are defined in each diagonal and off-diagonal cell, the size
reduction mode and the data mode as detailed in Fig. 4(b). In
the size reduction mode, the objective of each cell is to make
condition (9) valid. On the other hand, the cell only performs
data propagation in the data mode. The cell decides to work in
either mode depending on the occurrence of the logic control
signal “#”. For simplicity, we assume the cells execute all
operations in the data mode or the size-reduction mode in one
normalized cycle2. At T = 0, the external controller sends in
the logic control signal “#” to cell D33 through cell D44. At
T = 1, cell D33 works in the data mode due to the control
signal “#” and spreads out the “#” logic control signal to
the neighboring 3 cells. Meanwhile, D33 sends out the data
(r3,3, t3,3)
(∗) to cell O34. Note that the superscript (*) is a tag
bit attached to the data, which indicates that the data are sent
out by a diagonal cell. The occurrence of a tag bit (*) will drive
the off-diagonal cell to compute µ, and use µ to update the data
stored in that cell. As a result, at T = 2, cell O34 sends out
the newly computed µ to the two neighboring cells O24 and
D44. At next time instant (T = 3), the µ signal generated by
O34 meets the data coming from cell O23 (O43) inside the cell
O24(D44), and executes the size reduction update. At the same
time instant, data (r2,2, t2,2)(∗) enter cell O23. As cell O34
did at T = 2, cell O23 computes µ, updates (r2,3, t2,3), and
sends out µ to the neighboring cells O13 and D33. The most
important fact here is that cell O23 also propagates the data
(r2,2, t2,2)
(∗) to cell O24, and thus starts the column operations
between column 2 and column 4 at T = 4. Similarly, the
column operations between column 1 and column 4 begins at
T = 6 as (r2,2, t2,2)
(∗) enter cell O14. Essentially, full size
reduction is a series of column operations between column j
and columns j − 1, j − 2, · · · , 1, for all 2 ≤ j ≤ m, and we
2The real hardware cycle counts could be multiples of the normalized cycle.
WANG et al.: SYSTOLIC ARRAYS FOR LATTICE-REDUCTION-AIDED MIMO DETECTION 7
n+1
11 12
21 22
11
21 12
22
1121
1222
1121
1222
1
2
3
time
n+2n+3
(n: latency) 
(a) 
in
out
outin
out in
out in in
a =a
c =c +a b
(b) 
Fig. 1.  (a) Two-dimensional systolic array performing matrix 
calculation D = A B + C , where , , ,ij ij ij ija b c d  are the (i,j) entries of the 
matrix , , , and . (b) The operation of each processing element. 
D44
O13O12
O21 O23
O14
O24
D11
D22
O31 O32 O34D33
O41 O42 O43
 
(a) 
(*)
    
 "#"
 ;   ( , )
"true", if   =
"false",
:= - ;   ,   
in
out out
out out
in
out outin in in in
If  m ="#"
m c
d r x r t
d rswap
otherwise
(Default)
t t y x y y x x
' '
' '
! ' '
Data mode
Size Reduction mode 
 ( )
Diagonal cell Dii
in
i nout
q, t, rinx out
in
out
outout
wap
out
Off-diagonal cell Oij
q, t, rin
x
out
out
in inout
out in
 ( , );   
 If  carries "*":
       /
       : - ,  : -
       ;   
 If  doesn't carry "*":
       
in
out out in
in
in
in in
out out in
in
(If  c ="#")
x r t c c
(Default)
r x
r r x t t x
y x x
* *
' '
' ! ' !
' '
Data mode
Size Reduction mode 
 !
" #$ %
: , :
       ;      
in in in in
out outin in
r y x t t y x
y y x x
' ) ! ' ) !
' '
(b) 
Fig. 4.  (a) The systolic array for the linear LRAD of 4 4  MIMO system. (b) 
The operations of diagonal and off-diagonal cells in the systolic array. (“*” is 
an indicator bit used to control the flow of the algorithm, as explained in 
Section IV.A) 
10 15
0.2
0.4
0.6
0.8
Orhtogonality defect 
E
m
p
ir
ic
a
l 
c
u
m
u
la
ti
v
e
 p
ro
b
a
b
ili
ty
 f
u
n
c
ti
o
n
LLL
ASLR
FSR-LLL
No reduction
=.51
=.75
=.99
Fig. 3.  The empirical cumulative probability functions of the orthogonality 
defect  for the 4 4  channel matrices under three different reduction 
algorithms. 
H = H T
Lattice Reduction
Rounding
y
( )
Fig. 2.  Block diagram of linear lattice-reduction-aided detection 
(a)
 
n+1
11 12
21 22
11
21 12
22
1121
1222
1121
1222
1
2
3
time
n+2n+3
(n: latency) 
(a) 
in
out
outin
out in
out in in
a =a
c =c +a b
(b) 
Fig. 1.  (a) Two-dimensional systolic array performing matrix 
calculation D = A B + C , where , , ,ij ij ij ija b c d  are the (i,j) entries of the 
matrix , , , and . (b) The operation of each processing element. 
44
1312
21 23
14
24
11
22
31 32 3433
41 42 43
(a) 
 
(*)
2 2
    
 "#"
 ;   ( , )
1
"tr , = 2
"fals ,
:= - ;    
in
out out
out out
i
t tin in i in
If  m ="#"
m c
d r x r t
swap
i
lt)
t t y x
"#$
%
$&
' '
'
( )
!
Data ode
Size ed cti
 ( )
Diagonal cell Dii
inm
i nyoutc
q, t, rinx outx
ind
outd
outyoutc
swap
outm
Off-diagonal cell Oij
q, t, rin
x
outx
outc
inc inyouty
outy iny  ( , );   
 If  carries "*":
       /
       : - ,  : -
       ;   
 If  doesn't carry "*":
       
in
out out in
in
in
in in
out out in
in
(If  )
x r t c c
( efault)
x
r x
r r x t t x
y x x
x
r
*
* *
*
'
'
' ! ' !
' '
Data o
Size eduction ode 
 !
" #$ %
 
: , :
       ;      
in in in in
out outin in
r y x t t y x
y y x x
' ) ! ' ) !
' '
 
(b) 
Fig. 4.  (a) The systolic array for the linear LRAD of 4 4  MIMO system. (b) 
The operations of diagonal and off-diagonal cells in the systolic array. (“*” is 
an indicator bit used to control the flow of the algorithm, as explained in 
Section IV.A) 
10 15
0.2
0.4
0.6
0.8
Orhtogonality defect 
E
m
p
ir
ic
a
l 
c
u
m
u
la
ti
v
e
 p
ro
b
a
b
ili
ty
 f
u
n
c
ti
o
n
LLL
ASLR
FSR-LLL
No reduction
=.51
=.75
=.99
Fig. 3.  The empirical cumulative probability functions of the orthogonality 
defect  for the 4 4  channel matrices der three different reduction 
algorithms. 
H = H T
Lattice Reduction
Rou ding
y
( )
Fig. 2.  Block diagram of linear lattice-reduction-aided detection 
(b)
Fig. 4. (a) The systolic array for the linear LRAD of 4× 4 MIMO system.
(b) The operations of diagonal and off-diagonal cells in the systolic array.
(“*” is an indicator bit used to control the flow of the algorithm, as explained
in Section IV-A)
can conclude the following facts for an m×m MIMO system:
[Fact 1] In this systolic flow, the column operation between
column j and column i (i < j) begins at T = m+ j − 2i as
(ri,i, ti,i)
(∗) enters cell Oij .
Proof: Data (ri,i, ti,i)(∗) leaves cell Dii at T = m − i,
and it takes j − i cycles to have (ri,i, ti,i)(∗) propagates from
cell Dii to cell Oij .
[Fact 2] All column operations on column j end at T =
2m+ j − 3 in cell Omj .
Proof: In this systolic flow, the last column operation on
column j is always between column j and column 1, which
starts at T = m + j − 2 in cell O1j according to fact 1. It
takes m− 1 more cycles to propagate µ from cell O1j to cell
Omj and finish the column operation.
10 12 14 16
10
15
20
25
30
35
40
A
v
e
ra
g
e
 n
u
m
e
r 
o
f 
c
o
lu
m
n
 s
w
a
p
s FSR-LLL ( =.99)
ASLR ( =.99)
LLL( =.99)
LLL ( =.75)
Fig. 8.  The average number of column swaps in FSR-LLL, ASLR and 
LLL-aided MMSE detection in m m  MIMO system with  fixed at 20 
dB. 
10 15 20 25 30
10
-5
10
-4
10
-3
10
-2
10
-1
10
/N (dB)
B
it
-E
rr
o
r-
R
a
te
 (
B
E
R
)
MMSE-FSR
MMSE-ASLR
MMSE-LLL
MMSE
ML
8x84x4
4x4
8x8
(a) 
10 12 14 16 18
10
-4
10
-3
10
-2
10
-1
10
/N (dB)
B
it
-E
rr
o
r-
R
a
te
 (
B
E
R
)
MMSE-FSR-SIC
MMSE-ASLR-SIC
MMSE-LLL-SIC
ML
  
(b) 
Fig. 7.  BER performance of FSR-LLL and ASLR –based MMSE LRAD. 
(a)Linear detection ( 4 4 and 8 8  MIMO systems) (b)SIC (an 4 4  MIMO 
systems)
wap
#
%
vectoring cell rotation cell
in out
If " "
( )
wap true
& ' & '
( ) ( )
* +* +
, " ( )in
out in
#& '
( )
* +
, "
" ,"
#
##
%%
Fig. 6.  The operations of vectoring cells and rotation cells in the systolic array.
T=0    T=1
*
   T=2
*
   
T=3
*
*
  T=4
*
   T=5
*
   
T=6    T=7   T=8    
T=9    
* (*)( , )ii iir t
( , )ij ijr t
"#"
-
Data mode
Size Reduction mode
T=9  
Fig. 5.  Flow chart of the full size reduction operations in the systolic array.
Fig. 5. Flow chart of the full size reduction operations in the systolic array.
[Fact 3] The full size reduction ends at T = 3m − 3, when
all updates on column m are done.
Proof: The full size reduction ends when column m finish
all the column operations. Therefore, it follows the result in
fact 2 that the last step is at T = 3m− 3.
Referring back to the example mentioned in Section III-A,
we can have a more concrete view about the advantage of
FSR-LLL over the conventional LLL form when a systolic
array is used. If FSR-LLL is applied, the systolic array takes
a total of 3m − 3 cycles to end the all processes. However,
for non-systolic LLL, it takes 2m+ j − 3 to process column
j, and all column operations cannot be done in parallel. So
the total time to perform size reduction in non-systolic LLL
would be
∑m
j=3 (2m+ j − 3) = 2.5m
2 − 6.5m + 3 cycles
in that example. In this case, as m increases beyond 3, the
advantage of FSR-LLL over the conventional format becomes
significant.
2) Givens Rotation: As mentioned in Section III-C, we
use Siegel condition in the lattice reduction algorithm, which
only relates two r elements in the neighboring diagonal
cells. Hence, this condition can be checked during a full
size reduction step. For example, in Fig. 5 at T = 1,
8 TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND NETWORKS
 
10 12 14 16
10
15
20
25
30
35
40
A
v
e
ra
g
e
 n
u
m
e
r 
o
f 
c
o
lu
m
n
 s
w
a
p
s FSR-LLL ( =.99)
ASLR ( =.99)
LLL( =.99)
LLL ( =.75)
Fig. 8.  The average number of column swaps in FSR-LLL, ASLR and 
LLL-aided MMSE detection in m m  MIMO system with E N fixed at 
20 dB. 
10 15 20 25 30
10
-5
10
-4
10
-3
10
-2
10
-1
10
/N (dB)
B
it
-E
rr
o
r-
R
a
te
 (
B
E
R
)
MMSE-FSR
MMSE-ASLR
MMSE-LLL
MMSE
ML
8x84x4
4x4
8x8
(a) 
10 12 14 16 18
10
-4
10
-3
10
-2
10
-1
10
/N (dB)
B
it
-E
rr
o
r-
R
a
te
 (
B
E
R
)
MMSE-FSR-SIC
MMSE-ASLR-SIC
MMSE-LLL-SIC
ML
(b) 
Fig. 7.  BER performance of FSR-LLL and ASLR –based MMSE LRAD. 
(a)Linear detection ( 4 4  and 8 8  MIMO systems)(b)SIC (an 4 4 MIMO 
systems)
swap
"
# $
% 0
"
vectoring cell rotation cell
in" out"
If " "
( )
0
swap true
G
##
%
& ' & '
( ) ( )
* +* +
,
$
, " ( )in
out in
G
##
%%
& ' & '
( ) ( )
* +* +
$
, "
$
" ,"
#
# $#
% $%
Fig. 6.  The operations of vectoring cells and rotation cells in the systolic array.
T=0    T=1    T=2    
T=3   T=4    T=5    
T=6    T=7   T=8    
T=9    
(*)( , )ii iir t
( , )ij ijr t
"#"
Data mode
Size Reduction mode
T=9
Fig. 5.  Flow chart of the full size reduction operations in the systolic array.
Fig. 6. The operations of vectoring cells and rotation cells in the systolic
array.
cell D33 sends data r3,3 to cell D22 along with the “#”
signal. At the next time instant, cell D22 will check this
condition based on |r3,3|2
/
|r2,2|
2
, and also generate the logic
control signal “swap” (see Fig. 4(b)). If δ − 1/2 is greater
than |ri,i|2
/
|ri−1,i−1|
2 then “swap” is “true”, and drives
the vectoring cell to work. The operations of vectoring and
rotation cells are shown in Fig. 6. The vectoring cell zeros out
the input data β by the Givens rotation matrix G, which is
calculated based on Table I lines 13 to 15. The rotation cell
simply rotates the input data with the angle Θ given by the
neighboring vectoring cell. Hence, the vectoring and rotation
cells also work in a systolic way, with the rotation angle
Θ propagating between cells. As shown in Fig. 4(a), there
are 3 rotation cells and 1 vectoring cell between every two
consecutive rows of the systolic array. These cells perform the
Givens rotation to the R and QH data in those two rows. The
vectoring cell is located between cells Dii and Oi−1,i because
the Givens rotation step is executed prior to the column-swap
step in FSR-LLL, and data ri,i need be zeroed so that the
matrix R is still upper triangular after column swap.
Note that Givens rotation only applies to rows k′ and k′ −
1 during one iteration of FSR-LLL if k′ exists (lines 13~16
in Table I). However, every Dii (i = 1, · · · ,m − 1) could
generate the “swap” signal during the full size reduction step.
Therefore, we need a direct access from the external controller
to each diagonal cell in order to control the data path between
the diagonal cell and the vectoring cell. Namely, only cell
Dk′k′ can pass the signal “swap” to the vectoring cell and
perform the Givens rotation to rows k′ and k′−1. In Fig. 4(a),
we use a “switch” symbol between each pair of a diagonal cell
and a vectoring cell to represent the control by the external
controller. Only one switch is turned on during one iteration.
Additionally, a Givens rotation on rows k′ and k′ − 1
can begin right after rk′−1,k′ is updated during the full size
reduction step. For example, r3,4 is updated at T = 2 as shown
in Fig. 5, and Givens rotation on rows 3 and 4 could start as
early as T = 3 without any interference to the remaining
operations of full size reduction. This way, the time necessary
to perform Givens rotations can be partially hidden by the
full size reduction and this is the reason why we want the
Givens rotation to occur prior to column swap in our design.
For hardware implementation, one could consider using only
one rotation cell between every two neighboring rows or the
systolic array to reduce the hardware complexity. This will not
lead to significant increase in time if we consider performing
Givens rotation and full size reduction in parallel.
3) Column swap: The columns k′ and k′−1 of R (and T)
should be swapped, after the Givens rotation is done. However,
it is possible that the column swap be partially overlapped in
time with size reduction and Givens rotation. For example, the
column swap could begin after R being rotated but prior to
QH being updated since there is no need to swap columns of
QH .
The FSR-LLL stops when there is no possible column swap,
i.e., a k′ in Table I, line 11, does not exist. The system flow
(lines 3, 18 and 20 in Table I) is controlled by the external
processor. The lattice reduced matrices R˜ and Q˜H and the
unimodular matrix T stay in the PEs. The systolic array
along with these matrices will be used for linear detection,
as described in Section V below.
B. All-Swap Lattice Reduction (ASLR) Algorithm
The ASLR algorithm can also be performed by the systolic
array shown in Fig. 4(a). The process of full size reduc-
tion is the same as in Fig. 5. During full size reduction,
the Siegel condition is also checked in each diagonal cell
D11~Dm−1,m−1. If the current value of “order” is even (odd),
then the “switch” between each cell Dk−1,k−1 with even (odd)
index k and the vectoring cell is turned on by the external
controller. Consequently, for every even (odd) index k, Givens
rotation between rows k − 1 and k could be executed if
needed. As for the column swap step, more than one pair
of columns could be swapped during one iteration, but all
these pairs are swapped in parallel. Hence, the time spent
on columns swap is the same as on swapping a single pair
of columns. Based on this observation, we can expect the
systolic ASLR to work more efficient than the systolic FSR-
LLL. Comparisons between these two algorithms in terms of
bit-error-rate performance and of efficiency in execution time
are deferred to the next subsection.
Note that in our description we limit the applications of
this systolic array only to an m × m MIMO system. For
m ×m MMSE-LRAD, although the matrix QH is m × 2m
(the extended channel model in (5)), we can treat the subma-
trix QH1:m,(m+1):2m as another square matrix, and store each
element of QH1:m,(m+1):2m in the PE at the corresponding
position. Namely, qi,j and qi,j+m should be stored in the same
PE, which still keeps the systolic array square.
C. Comparison between FSR-LLL and ASLR algorithm
First, we compare the two algorithms in term of bit-error-
rate (BER) performance, and also compare them with the
conventional LLL algorithm. In our simulation, 4-QAM is
assumed for the transmitted symbols. The constant δ is set to
0.99 in all algorithms for fair comparison. Let Eb be defined
as the equivalent energy per bit at the receiver, and thus
Eb/N0 is m/(σ2 log2M). The Fig. 7(a) shows the BER results
of minimum mean-square-error LRAD (in 4 × 4 and 8 × 8
MIMO systems) based on FSR-LLL (denoted as MMSE-FSR),
ASLR algorithm (denoted as MMSE-ASLR) and the LLL
algorithm (denoted as MMSE-LLL). The BER results for ML
detection and MMSE without lattice reduction are also shown
for comparison. As δ = 0.99, the FSR-LLL and ASLR work as
WANG et al.: SYSTOLIC ARRAYS FOR LATTICE-REDUCTION-AIDED MIMO DETECTION 9
0 5 10 15 20 25 30
10−5
10−4
10−3
10−2
10−1
100
Eb/N0(dB)
Bi
t−
Er
ro
r−
R
at
e 
(B
ER
)
 
 
MMSE−FSR
MMSE−ASLR
MMSE−LLL
MMSE
ML
8x84x4
4x4
8x8
(a)
0 2 4 6 8 10 12 14 16 18
10−4
10−3
10−2
10−1
100
Eb/N0(dB)
Bi
t−
Er
ro
r−
R
at
e 
(B
ER
)
 
 
MMSE−FSR−SIC
MMSE−ASLR−SIC
MMSE−LLL−SIC
ML
(b)
Fig. 7. BER performance of FSR-LLL and ASLR- based MMSE LRAD.
(a)Linear detection (4×4 and 8×8 MIMO systems) (b)SIC (an 4×4 MIMO
systems)
well as LLL algorithm, and even slightly better in the case of
m = 8. It clearly shows that using the insignificantly weaker
Siegel condition does not deteriorate the BER performance
of linear detections in an MIMO system as compared to the
conventional LLL. In Fig. 7(b), the BER performance of an
4 × 4 MIMO system using LR-aided MMSE SIC based on
different lattice reduction algorithms are shown. Unlike the
linear detection case, the LLL-aided SIC works better than the
other two algorithms. Since the detection of the first layer in
SIC dominates the overall performance, it implies that due to
Siegel condition the FSR-LLL-reduced or the ASLR-reduced
channel provides lower SNR for the first layer in SIC than
the one given by the conventional LLL. Additionally, FSR-
LLL and ASLR lead to almost the same results in all three
MIMO systems, which is consistent with the results in Fig. 3.
Hence, we can conclude that although FSR-LLL and ASLR
give different lattice reduced matrices, the LRAD based on
these two algorithms have very similar BER performance.
Next, we compare the efficiency of the systolic array for
both algorithms. It is known that the number of iterations
of FSR-LLL and ASLR depends on the condition number of
the channel matrix. If H is well-conditioned, lattice reduction
4 6 8 10 12 14 16
0
5
10
15
20
25
30
35
40
m
Av
er
ag
e 
nu
m
er
 o
f c
ol
um
n 
sw
ap
 
 
FSR−LLL (δ=.99)
ASLR (δ=.99)
LLL(δ=.99)
LLL (δ=.75)
Fig. 8. The average number of column swaps in FSR-LLL, ASLR and LLL-
aided MMSE detection in m × m MIMO system with Eb/N0 fixed at 20
dB.
4 6 8 10 12 14 16
0
1
2
3
4
5
6 x 10
4
m
Av
er
ag
e 
nu
m
be
r o
f f
lo
p
 
 
FSR−LLL (δ=.99)
ASLR (δ=.99)
LLL (δ=.99)
LLL(δ=.75)
Fig. 9. The average number of floating point operations in FSR-LLL, ASLR
and LLL-aided MMSE detection in m×m MIMO system with Eb/N0 fixed
at 20 dB.
takes less iterations, and thus less cycles in the systolic array.
Since both algorithms begin with full size reduction, the total
execution time is fully determined by the number of column
swaps in the overall process. Less column swapping implies
less iterations. Fig. 8 shows the average number of column
swaps in FSR-LLL and ASLR-aided MMSE detection (with
Eb/N0 fixed at 20dB) in m×m MIMO systems (m = 4~16).
Note that for ASLR we count all the even or odd columns
swaps during one iteration as only one swap since they are
executed in parallel. In an 4×4 MIMO, the difference between
the two algorithms is almost negligible. However, as the
number of antennas grows, the advantage of ASLR becomes
significant. For m ≥ 8, ASLR has less than 65% the column
swaps comparing to FSR-LLL. Based on BER performance
and time-efficiency comparisons, ASLR should be a better
algorithm to be applied on our systolic array, especially with
a large number of antennas.
For comparison, the results of the conventional LLL with
δ = 0.99 and 0.75 are also shown in Fig. 8. As expected,
10 TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND NETWORKS
0 5 10 15 20 25 30
10−5
10−4
10−3
10−2
10−1
100
Eb/N0
Bi
t−
Er
ro
r−
R
at
e 
(B
ER
)
 
 
ASLR & FSR−LLL
(floating−point)
ASLR (fixed−point)
FSR−LLL (fixed−point)
Fig. 10. Comparison between the fixed-point and floating-point lattice
reduction algorithms using ZF-SIC in an 4× 4 MIMO system
LLL with δ = 0.99 has a higher complexity than LLL with
δ = 0.75. Furthermore, the conventional LLL has a much
higher average number of column swaps than FSR-LLL and
ASLR have in the higher-dimensional MIMO system (m ≥ 8).
However, it is not fair to conclude that the complexities of
FSR-LLL and ASLR are much lower than the conventional
LLL; in fact, full size reductions are performed in the former
two algorithms, and full size reduction needs more computa-
tion efforts than the conventional size reduction in LLL. In
Fig. 9, we compare the number of floating point operations
(flop) in LLL, FSR-LLL, and ASLR using the same settings
as in Fig. 8. The flops are counted in terms of number of
real additions and real multiplications. One complex addition
is counted as two flops (two real additions) and one complex
multiplication is counted as six flops (four real multiplications
and two real additions). The complexity of QR decomposition
is neglected, since this is done only once at the beginning
of the three algorithms. It is shown that LLL with δ = 0.99
has the highest complexity among the three. Under the same
δ (= 0.99) setting, FSR-LLL and ASLR have a much lower
computational complexity than LLL. On the other hand, the
complexity of LLL with δ = 0.75 is just slightly higher than
FSR-LLL and ASLR, even though the average number of
column swaps of LLL with δ = 0.75 is more than two times
larger than the one of ASLR for m ≥ 10. This implies that
the process of full size reduction introduces some additional
complexity. However, thanks to the (insignificantly) weaker
Siegel condition, the complexities of ASLR and FSR-LLL for
m ≥ 10 are less than 50% of the complexity of LLL with the
same δ setting.
To further explore the advantage of using systolic array,
we implement our proposed architecture for an 4× 4 MIMO
system onto FPGA. We performed our design using Xilinx
System Generator 11.5 (XSG) block-set in the Simulink de-
sign environment. A Verilog Hardware Description Language
(HDL) code is then generated automatically by XSG and is
synthesized by Xilinx XST. The place and route is done by
Xilinx ISE 11.5. The word-length of R, QH , T and µ are set
to (18,13), (14,13), (8,0) and (3,0), respectively. As mentioned
TABLE III
FPGA IMPLEMENTATION RESULTS
    
TABLE   LLL LGORITHM WITH ULL IZE EDUCTION
, , ,
  
  ,  
(1)    Initialization 
(2)    2
(3)    While 
           
(4)       for  , , 2
(5)          for  -1, ,1
(6)             
H H
i j i j i i
INPUT
OUTPUT
k m
j m
i j
r r
" "
Q , R
Q Q , R R T
T = I
Full Size Reduction
" #
% &
1: , 1: , , 1: ,
1: , 1: , , 1: ,
1, 1, 1
(7)             :
(8)             :
(9)          end
(10)     end
(11)     Find the smallest  between ~  
            such that 
i j i j i j i i
m j m j i j m i
k k k k k
k k m
r r r% % % % %& & &
" &
" &
& '
R R R
T T T
2 2
, 1, 1
1 1, 1: ,
2 , 1: ,
1 2
2 1
'-1: ', ' 1: ' 1: ', ' 1: '-1: '
(12)     If   exists 
               
(13)   
(14)       
(15)
(16) : ,  
k k k
k k k k k
k k k k k
k k k m k k k m k k
r r
r r
( (
( (
% % %& &
% % % % %& &
% % % % %
& & &
* +
, -
. /
" 0
Givens Rotation
R G R Q ,1: ' 1: ',1: :
               
(17)         Swap columns -1 and  in  and 
(18)         : max{ -1,2}
(19)     else
(20)         : 1
(21)     end
(22)   end
n k k n
k k
k k
k m
" 0
% %
" )
G Q
Column Swap
R T
TABLE II  LL WAP ATTICE EDUCTION LGORITHM
  
  ,  
(1)    Initialization 
(2)    =EVEN
(3)    While (any swap is possible in lines (9) or (16) )
            
(4)        Execute lines 4 ~ 10 in T
H H
INPUT
OUTPUT
order
" "
Q , R
Q Q , R R T
T = I
Full Size Reduction
2 2 2
1, 1, 1 , 1, 1
able I
            
(5)        If =EVEN
(6)            If  for all even 
(7)                go to line (13)
(8)            else
(9)      
k k k k k k k k
order
r r r r k& & & & && #
Givens Rotation and Column Swap
2 2 2
1, , 1, 1
          Execute lines 13~17 in Table 1
                     for all even  between 2 ~  
                     such that 
(10)               ODD
(11)           end
(12)       
k k k k k k
k m
r r
order
$ ! & & && '
2 2 2
1, 1, 1 , 1, 1
else
(13)           If  for all odd 
(14)               go to line (6)
(15)           else
(16)               Execute lines 13~17 in Table 1
                      for all o
k k k k k k k kr r r r k& & & & && #
2 2 2
1, , 1, 1
dd  between 2 ~  
                      such that 
(17)                =EVEN
(18)            end
(19)        end
(20)   end
k k k k k k
k m
r r
order
$ ! & & && '
TABLE III  FPGA IMPLEMENTATION RESULTS 
Target  
Algorithm
ASLR FSR-LLL CLLL [14] 
Device Virtex 51 Virtex 62 Virtex 51 Virtex 62 Virtex 4 Virtex 5
Slices 
2322
/20480
1812
/20000
2335 
/20480 
1798 
/20000
3617
/67584
1712 
/17280
Clock  
Frequency
160MHz 249MHz 155MHz 247MHz
140 
MHz
163 MHz
Avg.  
cycles(time) 
per channel 
matrix 
80 (SQRD) 84 (SQRD) 
130 (SQRD) 
500.0ns 321.3ns 541.9ns 340.1ns
146 (QRD) 164 (QRD) 
928.6ns 797.5ns
912.5ns 586.3ns 1058.1ns 664.0ns
1part number: XC5VFX130T       2part number: XC6VLX130T 
in Section III-C, the division in Siegel condition check can be
avoided by using a comparator. The divisions in the Givens
rotation are implemented by the Newton-Raphson iterative
algorithm [37]. As for µ, it can be easily shown by simulation
that |µ| is either 0, 1, or 2 over 99.7% of the time. Hence, we
can simply use a set of comparators to determine the value of
µ instead of using a division. For those |µ| greater than 2 are
saturated to 2, which rarely happened. The BER performance
of the fixed-point systolic implementation for an 4× 4 MIMO
system is shown in Fig. 10, where 16-QAM modulation and
ZF-SIC detection are applied. The implementation results are
shown in Table III. We consider both QRD and SQRD as
the pre-processes of the lattice reduction algorithms. From the
results, ASLR is superior to FSR-LLL in terms of the average
processing time, and this advantage is significant when QRD
is applied. The hardware complexity for ASLR and FSR-LLL
are about the same, since they only differ from each other
in the external controllers. It is also clear that SQRD reduces
the average processing time by over 45% comparing to using
the normal QRD, at the cost of higher computation efforts on
SQRD.
In Table III, the FPGA implementation result for the conven-
tional complex LLL (CLLL) [14] is also listed for comparison.
Under Virtex 5 and with SQRD, systolic ASLR operates at
a slightly lower speed than the one of CLLL; however our
designs require only 61.5% average clock cycles of theirs. As
a result, ASLR is on average faster than CLLL by a factor of
1.6. This verifies the high-throughput advantage of the systolic
arrays. On the other hand, systolic arrays implementation may
have higher hardware complexity since it requires several
processing elements to work in parallel. The results in Table III
shows that our designs occupied 36~38% more FPGA slices
than the one in CLLL. However, as the fast the advance of
FPGA technology and the semiconductor processing, one may
consider to trade some areas for a faster processing speed. As
shown in Table III, when using the latest Xilinx Virtex 6 FPGA
device, our systolic designs could run up to 249MHz and it
only requires less than 10% of the total FPGA slices.
V. SYSTOLIC ARRAY FOR DETECTION METHODS
A. Linear Detection in Systolic Array
After lattice reduction, the matrices Q˜H and R˜, along with
the unimodular matrix T, are stored in the systolic array. As
WANG et al.: SYSTOLIC ARRAYS FOR LATTICE-REDUCTION-AIDED MIMO DETECTION 11
shown in Fig. 2, the first step of a linear detection consists
of premultiplying the received signal vector y by H˜†, which
yields xˆ = H˜†y = R˜−1Q˜Hy. Second, the result of a matrix–
vector multiplication needs to be rounded element-wise. The
final step is to multiply the rounded results by the unimodular
matrix T and constrain all results within the constellation
boundary. If xˆq denotes the element-wise-rounded xˆ, the final
decision of the LRAD is xˆLR = Q(T · xˆq), as described in
Section II-C.
In the following discussion, we assume an 4 × 4 MIMO
system, and consider the zero-forcing detection first. The first
and last steps of a linear detection can be implemented by the
same systolic array of Fig. 4 without using extra cells. As for
the rounding and the final constellation boundary check, they
should be done outside the systolic array (they are not shown
in Fig. 11). To execute xˆ = R˜−1Q˜Hy in the systolic array, we
separate it into two matrix–vector multiplications v = Q˜Hy
and then xˆ = R˜−1v. Since Q˜H stays in the systolic arrays
after the lattice reduction ends, the received signal vector y can
be fed to the systolic arrays from the top in a skewed manner
as shown in Fig. 11(a). The vector Q˜Hy is pumped out from
the rightmost column of the array. Diagonal and off-diagonal
cells are needed at this stage, and the operations of the cells
are shown in Fig. 12(a). Every cell performs the multiply-and-
add operation. If MMSE is chosen, the input vector should be
changed to an 2m × 1 vector y according to the extended
model (5). Let y = [yT
1
yT
2
]T
and Q˜H = [Q1 Q2] , where
y1, y2 are m × 1 vectors and Q1, Q2 are m ×m matrices.
As mentioned in Section IV-B, the elements of Q1 and Q2
are stored in the same PEs. To compute v = Q˜Hy using
the systolic array, first we let y1 enter the array from the
top and multiply it by Q1, which is the same as shown in
Fig. 11(a). Then y2 enters the array right after y1, also in a
skewed manner, and is multiplied by Q2. Hence, for MMSE
we need an extra operation at the output of the array, which
is v = Q1y1 + Q2y2. For the remaining operations in the
systolic array, there is no difference between ZF and MMSE
detections.
The second stage consists of computing xˆ = R˜−1v.
Instead of computing R˜−1 directly, the following recursive
equation [38] is considered for the systolic design
xˆj =
1
r˜j,j

vj −
m∑
i=j+1
r˜j,ixˆi

 , j starts from m to 1. (14)
According to (14), it is clear that R˜−1v can be computed
directly from the components of R˜ without computing R˜−1.
Additionally, it can be implemented by the upper triangle part
of the systolic array, where matrix R˜ has already been stored.
As shown in Fig. 11(b), the vector v = Q˜Hy enters the array
from the right, and xˆ = R˜−1v is computed by the triangular
array with cell operations shown in Fig. 12(b). The output
vector xˆ is then rounded element-wise outside the systolic
array. The final step consists of multiplying the quantized
vector xˆq by the unimodular matrix T, which is also stored
in the array. Similar to the first step of a linear detection, it
is a matrix–vector multiplication between T and xˆq . Hence,
the data flow in Fig. 11(c) is the same as Fig. 11(a). The cell
 
q, t, r in
Diagonal cell Dii
q, t, r
Off-diagonal cell Oij
inout
in
out out
diagonal cell
off-diagonal cell     
j=i+1
off-diagonal cell 
j>i+1
out iny r
out in in
out in
x r y
y y
! "out in in
out in
x r y
y y
 ! "
(a)                                                    (b) 
Fig. 12.  (a) The data flow and (b) the detailed operations of the cells in the 
systolic array for the interference-cancellation step of LR-aided SIC.  
Diagonal cell Dii Off-diagonal cell Oij
q, t, r
inx out
in
out
q, t, r
inx out
in
out
operation diagonal and off-diagonal cells
Q y
T x
;  out outin in inx q y y y # "  
;  out outin in inx t y y y # "  
q, t, r
inx
Diagonal cell Dii
q, t, r
Off-diagonal cell Oij
inxout
in
out
out
operation diagonal cell off-diagonal cell
R v
out in in
out in
x r y
y y
! "
out iny y r
(a)                                                    (b) 
Fig. 11.  The detailed operations of the diagonal cells and off-diagonal cells in 
the systolic array at different stage. (a) Q y  and T x (b) R v . 
y1
y2
y3
y4
y
v4
v3
v2
v1
H
 v Q y"
v
xˆ
v4
v3
v2
v1
v
1ˆx
1ˆ ! x R v"
2xˆ
3
xˆ
4xˆ
ˆ
qx
ˆ
LR
x
1
ˆ
qx
1
ˆ
LR
x
ˆ ˆ( )LR q "x T x 2
ˆ
qx
3
ˆ
qx
4
ˆ
qx
2
ˆ
LR
x
3
ˆ
LR
x
4
ˆ
LR
x
(a)                                      (b)                                   (c) 
Fig. 10.  The linear detection operations in the systolic array. (a) v Q y (b) 
x R v  (c) ˆ ˆ( )
LR q
 "x T x . 
10 12 14 16
x 10
A
v
e
ra
g
e
 n
u
m
b
e
r 
o
f 
fl
o
p
s
FSR-LLL ( =.99)
ASLR ( =.99)
LLL ( =.99)
LLL( =.75)
Fig. 9.  The average number of floating point operations in FSR-LLL, ASLR 
and LLL-aided MMSE detection in m m  MIMO system with E N
fixed at 20 dB. 
Fig. 11. The linear detection operations in the systolic array. (a)v = Q˜Hy
(b)xˆ = R˜−1v (c)xˆLR = Q(T · xˆq).
q, t, r in
Diagonal cell Dii
q, t, r
Off-diagonal cell Oij
inout
in
out out
diagonal cell
off-diagonal cell     
j=i+1
off-diagonal cell 
j>i+1
out iny r
out in in
out in
x r y
y y
! "out in in
out in
x r y
y y
 ! "
(a)                           (b) 
Fig. 13.  (a) The data flow and (b) the detailed operations of the cells in the 
systolic array for the interference-cancellation step of LR-aided SIC.  
Diagonal cell Dii f  ll ij
q, t, r
inx outx
iny
outy
q, t, r
inx outx
iny
uty
operation diagonal ff- i l cells
HQ y"
ˆ q"T x
;  out outin in inx x q y y y"
;  out outin in inx x t y y y # "  
q, t, r
in
i l i
q, t, r
O f-diagonal cell Oij
inout
in
out
out
o r ti i al cell off-diagonal cell
xˆ R v
out in in
out in
x r y
y y
! "
out iny x r
(a)                                                    (b) 
Fig. 12.  The detailed operations of the diagonal cells and off-diagonal cells in 
the systolic array at different stage. (a)  and T x (b) R v . 
v Q y
ˆ ˆ( )LR qx T x
z
z
z
z
(a)                                     (b)                                   (c) 
Fig. 11.  The linear detection operations in the systolic array. (a) v Q y  (b) 
x R v  (c) ˆ ˆ( )
LR q
 "x T x . 
10 15 20 25 30
10
-5
10
-4
10
-3
10
-2
10
-1
10
/N
B
it
-E
rr
o
r-
R
a
te
 (
B
E
R
)
ASLR & FSR-LLL
(floating-point)
ASLR (fixed-point)
FSR-LLL (fixed-point)
Fig. 10.  Comparison between the fixed-point and floating-point lattice 
reduction algorithms using ZF-SIC in an 4 4  MIMO system 
10 12 14 16
x 10
A
v
e
ra
g
e
 n
u
m
b
e
r 
o
f 
fl
o
p
s
FSR-LLL ( =.99)
ASLR ( =.99)
LLL ( =.99)
LLL( =.75)
Fig. 9.  The average number of floating point operations in FSR-LLL, ASLR 
and LLL-aided MMSE detection in m m  MIMO system with  fixed at 
20 dB. 
(a)
q, t, r inx
Diagonal cell Dii
q, t, r
O f-di ij
inxout
i
out out
diagonal cell
off-diagonal ce l   
j=i+1
l cel  
j i
out iny r
out in in
out in
x r y
y y
"!out in in
out in
x r y
y y
 ! "
(a)                           (b) 
Fig. 13.  (a) The data flow and (b) the detailed operations of the cells in the 
systolic array for the interference-cancellation step of LR-aided SIC.  
Diagonal cell Dii Off-diagonal cell Oij
q, t, r
inx out
in
out
q, t, r
inx out
in
out
operation diagonal and off-diagonal cells
Q y
T x
;  out outin in inx q y y y# "  
;  out outin in inx t y y y# "  
q, t, r
inx
Diagonal cell Dii
q, t, r
Off-diagonal cell Oij
inxoutx
iny
outy out
y
operation diagonal cell off-diagonal cell
1ˆ ! x R v"
out in in
out in
x x r y
y y
 ! "
 
out iny x r 
(a)                                                    (b) 
Fig. 12.  The detailed operations of the diagonal cells and off-diagonal cells in 
the systolic array at different stage. (a) y  and T x (b) R v . 
v Q y
ˆ ˆ( )LR qx T x
z
z
z
z
(a)                                     (b)                                   (c) 
Fig. 11.  The linear detection operations in the systolic array. (a) v Q y  (b) 
x R v  (c) ˆ ˆ( )
LR q
 "x T x . 
10 15 20 25 30
10
-5
10
-4
10
-3
10
-2
10
-1
10
/N
B
it
-E
rr
o
r-
R
a
te
 (
B
E
R
)
ASLR & FSR-LLL
(floating-point)
ASLR (fixed-point)
FSR-LLL (fixed-point)
Fig. 10.  Comparison between the fixed-point and floating-point lattice 
reduction algorithms using ZF-SIC in an 4 4  MIMO system 
10 12 14 16
x 10
A
v
e
ra
g
e
 n
u
m
b
e
r 
o
f 
fl
o
p
s
FSR-LLL ( =.99)
ASLR ( =.99)
LLL ( =.99)
LLL( =.75)
Fig. 9.  The average number of floating point operations in FSR-LLL, ASLR 
and LLL-aided MMSE detection in m m  MIMO system with  fixed at 
20 dB. 
(b)
Fig. 12. The detailed operations of the diagonal cells and off-diagonal cells
in the systolic array at different stage. (a)Q˜Hy and T · xˆq (b)R˜−1v.
operations for T · xˆq are shown in Fig. 12(a), and the array
output being quantized to the closest constellation point is the
final result xˆLR of the linear LRAD.
B. Spatial-Interference Cancellation in Systolic Array
The successive spatial-interference cancellation (SIC) can
also be performed on this systolic array with some modi-
fications to the PEs. Observing the first step of LR-aided
SIC sh wing in (7), it should be ap arent that Q˜Hy can
be performed in the systolic array in the same fashion as in
Fig. 11(a) and Fig. 12(a). The second step (8) of LR-aided
SIC can be done in the systolic array as shown in Fig 13. It
is almost the same operations as the one Fig. 12(b), except
that we have to do a rounding in the off-diagonal cells Oij
at the super-diagonal position (j = i + 1). The rounding
operations are for the decision of each zˆi. Similar to the linear
LRAD, the final step of LR-aided SIC is to multiply z by the
unimodular matrix T and bound all the output within the QAM
constellation. It can be done in the same way as in Fig 11(c)
and Fig. 12(a), with xˆq being replaced by zˆ.
Notice that lattice reduction and linear detection (or SIC) are
performed in the same systolic array, and it can be hardware-
efficient to share the adder/multiplier/divider designed for
lattice reduction processing. For instance, there is one addition,
one multiplication, and one division in each diagonal cell,
and one addition and one multiplication in each off-diagonal
cell for linear detection or SIC, be it ZF or MMSE. These
operations are also contained in each cell at the LLL lattice
reduction stage. For SIC, it seems that we need extra rounding
operations in those off-diagonal cells at the superdiagonal
position. Now, we need those rounding operations in the off-
diagonal cells during the full size reduction processing as
12 TO BE APPEARED IN JOURNAL OF COMMUNICATIONS AND NETWORKS
zˆ
v1
v2
v3
v4
v
1ˆz
2zˆ
3zˆ
4zˆ q, t, r inx
Diagonal cell Dii
q, t, r
Off-diagonal cell Oij
inxoutx
iny
outy out
y
diagonal cell
off-diagonal cell     
(j=i+1)
off-diagonal cell 
(j>i+1)
out iny y r 
out in in
out in
x x r y
y y
 ! "
 
out in in
out in
x x r y
y y
 ! "
 
(a)                           (b) 
Fig. 13.  (a) The data flow and (b) the detailed operations of the cells in the 
systolic array for the interference-cancellation step of LR-aided SIC.  
Diagonal cell Dii Off-diagonal cell Oij
q, t, r
in out
in
out
q, t, r
in out
in
out
operation diagonal and off-diagonal cells
Q y
T x
;  out outin in inx q y y y# "  
;  out outin in inx t y y y# "  
q, t, r
in
Diagonal cell Dii
q, t, r
Off-diagonal cell Oij
inout
in
out
out
operation diagonal cell off-diagonal cell
x R v
out in in
out in
x r y
y y
! "
out iny x r
(a)                                                    (b) 
Fig. 12.  The detailed operations of the diagonal cells and off-diagonal cells in 
the systolic array at different stage. (a)  and T x (b) R v . 
y
v Q y
ˆ ˆ( )LR qx T x
z
z
z
z
(a)                                     (b)                                   (c) 
Fig. 11.  The linear detection operations in the systolic array. (a) v Q y  (b) 
x R v  (c) ˆ ˆ( )
LR q
 "x T x . 
10 15 20 25 30
10
-5
10
-4
10
-3
10
-2
10
-1
10
/N
B
it
-E
rr
o
r-
R
a
te
 (
B
E
R
)
ASLR & FSR-LLL
(floating-point)
ASLR (fixed-point)
FSR-LLL (fixed-point)
Fig. 10.  Comparison between the fixed-point and floating-point lattice 
reduction algorithms using ZF-SIC in an 4 4  MIMO system 
10 12 14 16
x 10
A
v
e
ra
g
e
 n
u
m
b
e
r 
o
f 
fl
o
p
s
FSR-LLL ( =.99)
ASLR ( =.99)
LLL ( =.99)
LLL( =.75)
Fig. 9.  The average number of floating point operations in FSR-LLL, ASLR 
and LLL-aided MMSE detection in m m  MIMO system with  fixed at 
20 dB. 
Fig. 13. The data flow and the detailed operations of the cells in the systolic
array for the interference-cancellatio step of LR-aided SIC.
well. Hence, there need be no extra hardware cost (adders
or multipliers) in each cell for linear detection. Only extra
control logic to the array is needed in order to have each PE
work correctly in different modes.
VI. CONCLUSION
In this paper, we have described a systolic array perform-
ing LLL-based lattice-reduction-aided detection for MIMO
receivers. Lattice reduction and the ensuing linear detection
or successive spatial-interference cancellation can be executed
by the same array, with minimum global access to each
processing element. The proposed systolic array with external
logic controller can work with two different lattice-reduction
algorithms. One is LLL algorithm with full size reduction,
which is a different form of the conventional LLL algorithm
and more suitable for parallel processing. The second one
is an all-swap complex lattice-reduction algorithm, which
generalizes the one originally proposed in [30] for real lattices.
Compared to FSR-LLL, ASLR operates on a whole matrix,
rather than on its single columns, during the column-swap
and Givens-rotation steps. To reduce the complexity of data
communications between processing elements in the systolic
array, we replace Lovász condition in the LLL algorithm by
Siegel condition. Even though Siegel condition is weaker than
Lovász condition, the BER performance of LR-aided linear
detections based on our two algorithm versions appears to be
as good as using the conventional LLL, and the computational
complexity is reduced by the relaxation as well. Based on BER
performance and time-efficiency comparisons, ASLR should
be preferred to FSR-LLL, especially for an MIMO system
with a large number of antennas. The FPGA implementation
results also show that our proposed systolic architecture for
lattice reduction algorithms run about 1.6× faster than the
conventional LLL, at the cost of moderate increases of hard-
ware complexity. Additionally, due to the high- throughput
property of systolic arrays, our design appears very promising
for high-data-rate systems, such as in a MIMO-OFDM system.
REFERENCES
[1] G. J. Foschini and M. J. Gans, “On limits of wireless communications in
a fading environment when using multiple antennas,” Wireless Personal
Communications, vol. 6, pp. 311–335, 1998.
[2] E. Biglieri, R. Calderbank, A. Constantinides, A. Goldsmith, A. Paulraj,
and H. V. Poor, MIMO Wireless Communications. New York, NY,
USA: Cambridge University Press, 2007.
[3] Z. Guo and P. Nilsson, “A VLSI implementation of MIMO detection for
future wireless communications,” in Proc. IEEE Personal, Indoor and
Mobile Radio Communications, vol. 3, 2003, pp. 2852–2856.
[4] M. Myllyla, J. Hintikka, J. Cavallaro, M. Juntti, M. Limingoja, and
A. Byman, “Complexity analysis of MMSE detector architectures for
MIMO OFDM systems,” in Proc. the Thirty-Ninth Asilomar Conference
on Signals, Systems and Computers, 2005, pp. 75–81.
[5] M. Karkooti, J. Cavallaro, and C. Dick, “FPGA implementation of ma-
trix inversion using QRD-RLS algorithm,” in Proc. Asilomar Conference
on Signals, Systems and Computers, 2005, pp. 1625–1629.
[6] H. Yao and G. Wornell, “Lattice-reduction-aided detectors for MIMO
communication systems,” in IEEE Global Telecommunications Confer-
ence, GLOBECOM, vol. 1, 2002, pp. 424–428.
[7] D. Seethaler, G. Matz, and F. Hlawatsch, “Low-Complexity MIMO data
detection using seysen’s lattice reduction algorithm,” in Proc. IEEE
International Conference on Acoustics, Speech and Signal Processing,
ICASSP, vol. 3, 2007, pp. III–53–III–56.
[8] D. Wübben, R. Böhnke, V. Kühn, and K.-D. Kammeyer, “Near-
maximum-likelihood detection of MIMO systems using MMSE-based
lattice reduction,” in Proc. IEEE International Conference on Commu-
nications, vol. 2, 2004, pp. 798–802.
[9] A. K. Lenstra, H. W. Lenstra, and L. Lovász, “Factoring polynomials
with rational coefficients,” Mathematische Annalen, vol. 261, no. 4, pp.
515–534, 1982.
[10] Y. H. Gan, C. Ling, and W. H. Mow, “Complex lattice reduction
algorithm for Low-Complexity Full-Diversity MIMO detection,” IEEE
Trans. on Signal Processing, vol. 57, no. 7, pp. 2701–2710, 2009.
[11] X. Ma and W. Zhang, “Performance analysis for MIMO systems with
lattice-reduction aided linear equalization,” IEEE Trans. on Communi-
cations, vol. 56, no. 2, pp. 309–318, 2008.
[12] M. Taherzadeh, A. Mobasher, and A. Khandani, “LLL reduction
achieves the receive diversity in MIMO decoding,” IEEE Trans. on
Inform. Theory, vol. 53, no. 12, pp. 4801–4805, 2007.
[13] J. Jaldén, D. Seethaler, and G. Matz, “Worst- and average-case complex-
ity of LLL lattice reduction in MIMO wireless systems,” in Proc. IEEE
International Conference on Acoustics, Speech and Signal Processing,
ICASSP, 2008, pp. 2685–2688.
[14] B. Gestner, W. Zhang, X. Ma, and D. Anderson, “VLSI implementation
of a lattice reduction algorithm for Low-Complexity equalization,”
in Proc. IEEE International Conference on Circuits and Systems for
Communications, ICCSC, 2008, pp. 643–647.
[15] C. P. Schnorr and M. Euchner, “Lattice basis reduction: Improved
practical algorithms and solving subset sum problems,” Mathematical
Programming, vol. 66, no. 1-3, pp. 181–199, 1994.
[16] J. Jaldén and P. Elia, “DMT optimality of LR-Aided linear decoders for
a general class of channels, lattice designs, and system models,” IEEE
Trans. on Information Theory, vol. 56, no. 10, pp. 4765–4780, 2010.
[17] H. Vetter, V. Ponnampalam, M. Sandell, and P. Hoeher, “Fixed complex-
ity LLL algorithm,” IEEE Trans. on Signal Processing, vol. 57, no. 4,
pp. 1634–1637, 2009.
[18] H. T. Kung and C. E. Leiserson, “Algorithms for VLSI processor arrays,”
in Introduction to VLSI Systems. Addison-Wesley, 1980, p. 271.
[19] S. Y. Kung, “VLSI array processors,” IEEE ASSP Magazine, vol. 2,
no. 3, pp. 4–22, 1985.
[20] W. M. Gentleman and H. T. Kung, “Matrix triangulation by systolic
arrays,” in Proc. of SPIE: Real-time Signal Processing IV, vol. 298,
1981, pp. 19–26.
[21] A. El-Amawy and K. Dharmarajan, “Parallel VLSI algorithm for stable
inversion of dense matrices,” IEE Proc. Computers and Digital Tech-
niques, vol. 136, no. 6, pp. 575–580, 1989.
[22] C. Rader, “VLSI systolic arrays for adaptive nulling,” IEEE Signal
Processing Magazine, vol. 13, no. 4, pp. 29–49, 1996.
[23] K. Liu, S.-F. Hsieh, K. Yao, and C.-T. Chiu, “Dynamic range, stability,
and fault-tolerant capability of finite-precision RLS systolic array based
on givens rotations,” IEEE Trans. on Circuits and Systems, vol. 38, no. 6,
pp. 625–636, 1991.
[24] D. Boppana, K. Dhanoa, and J. Kempa, “FPGA based embedded pro-
cessing architecture for the QRD-RLS algorithm,” in Proc. IEEE Sym-
posium on Field-Programmable Custom Computing Machines, vol. 0,
2004, pp. 330–331.
[25] K. Yao and F. Lorenzelli, “Systolic algorithms and architectures for
High-Throughput processing applications,” Journal of Signal Processing
Systems, vol. 53, no. 1-2, pp. 15–34, 2008.
[26] J. Wang and B. Daneshrad, “A universal systolic array for linear MIMO
detections,” in Proc. IEEE Wireless Communications and Networking
Conference, 2008, pp. 147–152.
WANG et al.: SYSTOLIC ARRAYS FOR LATTICE-REDUCTION-AIDED MIMO DETECTION 13
[27] K. Seki, T. Kobori, J. Okello, and M. Ikekawa, “A CORDIC-Based
reconfigrable systolic array processor for MIMO-OFDM wireless com-
munications,” in IEEE Workshop on Signal Processing Systems, 2007,
pp. 639–644.
[28] Y. Hu, “CORDIC-based VLSI architectures for digital signal process-
ing,” IEEE Signal Processing Magazine, vol. 9, no. 3, pp. 16–35, 1992.
[29] B. Cerato, G. Masera, and P. Nilsson, “Hardware architecture for matrix
factorization in mimo receivers,” in Proc. ACM Great Lakes symposium
on VLSI, New York, NY, USA, 2007, p. 19699.
[30] C. Heckler and L. Thiele, “A parallel lattice basis reduction for mesh-
connected processor arrays and parallel complexity,” in Proc. IEEE
Symposium on Parallel and Distributed Processing, 1993, pp. 400–407.
[31] J. W. S. Cassels, Rational quadratic forms. London; New York:
Academic Press, 1978.
[32] B. Hassibi, “An efficient square-root algorithm for BLAST,” in Proc.
IEEE International Conference on Acoustics, Speech, and Signal Pro-
cessing, vol. 2, 2000, pp. II737–II740.
[33] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, “Closest point search in
lattices,” IEEE Trans. on Inform. Theory, vol. 48, no. 8, pp. 2201–2214,
2002.
[34] L. Babai, “On lovász’ lattice reduction and the nearest lattice point
problem,” Combinatorica, vol. 6, no. 1, pp. 1–13, 1986.
[35] R. Döhler, “Squared givens rotation,” IMA Journal of Numerical Anal-
ysis, vol. 11, no. 1, pp. 1 –5, Jan. 1991.
[36] P. Luethi, A. Burg, S. Haene, D. Perels, N. Felber, and W. Fichtner,
“VLSI implementation of a High-Speed iterative sorted MMSE QR
decomposition,” in Proc. IEEE International Symposium on Circuits and
Systems, 2007, pp. 1421–1424.
[37] C. V. Ramamoorthy, J. R. Goodman, and K. H. Kim, “Some properties
of iterative Square-Rooting methods using High-Speed multiplication,”
IEEE Trans. on Computers, vol. C-21, no. 8, pp. 837–847, 1972.
[38] F. Lorenzelli, P. Hansen, T. Chan, and K. Yao, “A systolic implemen-
tation of the Chan/Foster RRQR algorithm,” IEEE Trans. on Signal
Processing, vol. 42, no. 8, pp. 2205–2208, 1994.
