High performance lattice reduction on heterogeneous computing platform by Jozsa, Csaba M et al.
 Document downloaded from: 
 
This paper must be cited as:  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
The final publication is available at 
 
 
Copyright 
http://dx.doi.org/10.1007/s11227-014-1201-2
http://hdl.handle.net/10251/49342
Springer Verlag (Germany)
Jozsa, CM.; Domene Oltra, F.; Vidal Maciá, AM.; Piñero Sipán, MG.; González Salvador,
A. (2014). High performance lattice reduction on heterogeneous computing platform.
Journal of Supercomputing. 70(2):772-785. doi:10.1007/s11227-014-1201-2.
Noname manuscript No.
(will be inserted by the editor)
High Performance Lattice Reduction on
Heterogeneous Computing Platform
Csaba M. Jo´zsa · Fernando Domene ·
Antonio M. Vidal · Gema Pin˜ero ·
Alberto Gonza´lez
Received: date / Accepted: date
Abstract The lattice reduction (LR) technique has become very important in
many engineering fields. However, its high complexity makes difficult its use in
real-time applications, especially in applications that deal with large matrices. As
a solution, the Modified Block LLL (MB-LLL) algorithm was introduced in [10],
where several levels of parallelism were exploited: (i.) coarse-grained parallelism
was achieved by applying the block-reduction concept presented in [15] and (ii.)
fine-grained parallelism was achieved through the Cost Reduced All-Swap LLL
(CR-AS-LLL) algorithm introduced in [10].
In this paper, we present the Cost Reduced MB-LLL (CR-MB-LLL) algorithm,
which allows to significantly reduce the computational complexity of the MB-LLL
by allowing the relaxation of the first LLL condition while executing the LR of
submatrices, resulting in the delay of the GS coefficients update and by using less
costly procedures during the boundary checks. The effects of complexity reduction
and implementation details are analyzed and discussed for several architectures.
A mapping of the CR-MB-LLL on a heterogenenous platform is proposed and
it is compared with implementations running on a dynamic parallelism enabled
GPU and a multi-core CPU. The mapping on the architecture proposed allows a
dynamic scheduling of kernels where the overhead introduced is hidden by the use
of several CUDA streams. Results show that the execution time of the CR-MB-
LLL algorithm on the heterogeneous platform outperforms the multi-core CPU
and it is more efficient than the CR-AS-LLL algorithm in case of large matrices.
Keywords Lattice Reduction · LLL · GPU · CUDA · OpenMP
Csaba M. Jo´zsa
Faculty of Information Technology, Pa´zma´ny Pe´ter Catholic University, Hungary
E-mail: jozsa.csaba@itk.ppke.hu
Fernando Domene, Gema Pin˜ero, Alberto Gonza´lez
Inst. Telecommunications and Multimedia Applications (iTEAM), Universitat Polite`cnica de
Vale`ncia, Spain
E-mail: {ferdool,gpinyero,agonzal}@iteam.upv.es
Antonio M. Vidal
Dept. of Information Systems and Computation, Universitat Polite`cnica de Vale`ncia, Spain
E-mail: avidal@dsic.upv.es
2 Csaba M. Jo´zsa et al.
1 Introduction
The application of lattice reduction (LR) as a preconditioner of various signal
processing algorithms plays a key role in several fields: communications, crypto-
graphy, image processing, etc. Given a basis, LR consists of finding another basis
whose vectors are more orthogonal and shorter, in the sense of Euclidean norm,
than the original ones. The Minkowski or Hermite-Korkine-Zolotareff reductions
are the techniques that obtain the best performance in terms of reduction, but also
the ones with a higher computational cost. Both techniques require the calculation
of the shortest lattice vector, which has been proved to be NP-hard (see [17] and
references therein).
In order to reduce the computational complexity of LR techniques, a poly-
nomial time algorithm was proposed by Lenstra, Lenstra and Lova´sz, known as
the LLL algorithm [11]. This algorithm can be seen as a relaxation of Hermite-
Korkine-Zolotareff conditions [4] or an extension of Gauss reduction [17] and ob-
tains the reduced basis by applying two different operations over the original basis:
size-reduction (linear combination between columns) and column swap. Although
further reduction techniques have been proposed afterwards, the LLL algorithm is
the most used due to the good trade-off between performance and computational
complexity.
Regarding the hardware implementation of the LLL algorithm, several solu-
tions can be found in the literature. Implementations that make use of LR to
improve the detection performance of multiple antenna systems can be found
in [3, 8, 13, 16]. In [16], an LR-aided symbol detector for multiple-input multiple-
output (MIMO) and orthogonal frequency division multiple access (OFDMA) is
implemented using 65 nm ASIC technologies. A field-programmable gate array
(FPGA) implementation of a variant of the LLL algorithm, the Clarkson’s algo-
rithm, is presented in [3], whose main benefit is the computational complexity re-
duction without significant performance loss in MIMO detection. More recently, [8]
makes use of a Xilinx XC4VLX80-12 FPGA for implementing LR-aided detectors,
whereas [13] uses an efficient VLSI design based on a pipelined architecture.
In [10] low level, fine-grained parallelism was implemented by the CR-AS-LLL
algorithm, where the low processing time is assured by an efficient work distri-
bution, minimizing the idle time of the launched threads. Based on the parallel
block-reduction concept presented in [15], a higher level, coarse-grained parallelism
can be applied as an extra level of parallelism. The idea is to subdivide the original
lattice basis matrix in several smaller submatrices and perform an independent LR
on them followed by a boundary check between adjacent submatrices. The MB-
LLL algorithm proposed in [10] implements a parallel processing of the submatrices
by using the parallel CR-AS-LLL algorithm for the LR of every submatrix.
The implementations of the previous references make use of only one architec-
ture to calculate the LR of a basis. However, a better performance can be obtained
by combining different architectures, what is known as heterogeneous computing.
Among the different combinations, the use of CPU and GPU is probably the most
popular since it can be found in most of the computers.
In this paper, we present the Cost Reduced MB-LLL (CR-MB-LLL) algorithm
in order to further reduce the computational complexity of the MB-LLL algo-
rithm [10]. The main idea behind the CR-MB-LLL algorithm is the relaxation
of the first LLL condition while executing the LR for the submatrices, resulting
High Performance Lattice Reduction on Heterogeneous Computing Platform 3
in the delay of the GS coefficients update and by using less costly procedures
when performing the boundary checks. The effects of this complexity reduction
are evaluated on different architectures.
A mapping of the CR-MB-LLL algorithm on a heterogeneous platform con-
sisting of a CPU and a GPU is proposed and it is compared with implementations
running on a GPU with dynamic parallelism (DP) capability and a multi-core
CPU architecture. The proposed architecture allows a dynamic scheduling of ker-
nels where the overhead introduced by host-device communication is hidden by
the use of CUDA streams. Results show that the CR-MB-LLL algorithm executed
on the heterogeneous platform outperforms the DP-based GPU and multi-core
implementations.
The algorithm mapping on different parallel architectures is very challenging,
since the number of processing cores, latency and size of the different memory hi-
erarchies and available cache is different. The mapping details of the CR-AS-LLL
and CR-MB-LLL algorithms to different parallel architectures is also presented
with a special emphasis on the work distribution among the threads and the effi-
cient memory utilization.
The paper is organized as follows. In Section 2 a brief introduction to LR is
given. The architecture dependent mapping details of the CR-AS-LLL and CR-
MB-LLL are presented in Section 3, followed by their performance evaluations on
different platforms in Section 4. Finally, conclusions are stated in Section 5.
2 Problem description
A real-valued lattice L =
{∑n
i=1 xibi
∣∣ xi ∈ Z, i = 1, · · · , n} is a discrete additive
subgroup of Rn, where b1, b2, · · · , bn ∈ Rn are linearly independent vectors and Z
denotes the set of integers. Let matrix B = (b1, . . . , bn) ∈ Rn×n denote the full
(column) ranked basis of the lattice. Let B∗ = (b∗1, . . . , b
∗
n) ∈ Rn×n denote the
associated orthogonal basis of B, calculated by the Gram-Schmidt (GS) orthog-
onalization process as b∗1 = b1 and b
∗
i = bi −
∑i−1
j=1 µi,jb
∗
j for 2 ≤ i ≤ n, where
µi,j = (bi, b
∗
j )/(b
∗
j , b
∗
j ) for 1 ≤ j < i ≤ n, also called the GS coefficients and (·, ·)
denotes the ordinary dot product on Rn. Thus, matrix B can be expressed as
B = B∗ ·U, where matrix U is upper triangular with unit diagonal, and whose
element (i, j) above the diagonal is given by the GS coefficient µj,i.
Definition 1 Given a lattice L ∈ Rn with basis B ∈ Rn×n, associated orthogonal
basis B∗ ∈ Rn×n, and GS coefficients µi,j , B is called LLL-reduced if the following
conditions are satisfied:
|µi,j | ≤ 1
2
for 1 ≤ j < i ≤ n (1)
‖b∗i + µi,i−1b∗i−1‖ ≥ δ‖b∗i−1‖ for 1 < i ≤ n,
3
4
≤ δ < 1 (2)
3 Parallel Lattice-Reduction algorithms and their mapping to parallel
architectures
Since, the LLL algorithm shows a highly sequential behavior, multiple levels of
parallelism have to be identified and exploited in order to efficiently parallelize
4 Csaba M. Jo´zsa et al.
this algorithm. Dividing the problem in several sub-problems that can be exe-
cuted concurrently, can be regarded as one level of parallelism. In addition, if a
sub-problem could benefit from a multi-threaded environment it can be regarded
as a second level of parallelism. Previous parallel LR implementations, such as the
ones presented in [1, 2, 12], have focused only on multi-core architectures. The
main drawback of the low number of threads offered by modern CPUs (compared
to GPUs) is that low level parallelism can not be efficiently exploited. During an
algorithm design, low level parallelism is usually omitted and the levels of paral-
lelism are also restricted. In case of GPUs, the high number of CUDA cores makes
possible the parallel execution of a high number of threads leading to significant
performance improvements.
3.1 Mapping details of the CR-AS-LLL algorithm
Basically, LR consists of a succession of swaps between vectors of the basis and
some operations to decrease their norms. The order in which the swaps are applied
in the LLL algorithm is limiting in a parallel framework. Villard in [14] introduced
the any swap reduction concept, that enables simultaneous basis swaps and served
as a basis for future parallel implementations. In [12], the concept of delaying the
size reductions was introduced. In the CR-AS-LLL algorithm, further computa-
tional cost is saved by rearranging and delaying the frequently used size reduction
procedure.
Procedures SimpleSizeReduce, SimpleSwap and Swap are defined in order
to give an accurate description of the CR-AS-LLL algorithm.
Procedure 1 (SimpleSizeReduce(Bi, k, l)) Given a lattice generator matrix B
and the associated GS coefficients matrix U, if the condition (1) is not satisfied,
i.e. |µk,l| > 12 , the following updates have to be applied:
• µ = dµk,lc, µk,l = µk,l − µ, bk = bk − µbl.
Procedure 2 (SimpleSwap(Bi, k)) Given a lattice generator matrix B, the as-
sociated orthogonal basis B∗ and GS coefficients matrix U, if the condition (2) is
not satisfied, or equivalently ‖b∗k‖2 < (δ − µ2k,k−1)‖b∗k−1‖2, the following updates
have to be applied:
• swap bk with bk−1
• b∗pk−1 = b∗k + µk,k−1b∗k−1, µpk,k−1 = (b∗k−1, b∗pk−1)/‖b∗pk−1‖2,
b∗pk = b
∗
k−1 − µpk,k−1b∗pk−1
• b∗k−1 = b∗pk−1,b∗k = b∗pk ,µk,k−1 = µpk,k−1
Procedure 3 (Swap(Bi, k)) Given a lattice generator matrix B, the associated
orthogonal basis B∗ and GS coefficients matrix U, if the condition (2) is not
satisfied, or equivalently ‖b∗k‖2 < (δ − µ2k,k−1)‖b∗k−1‖2, the following updates have
to be applied:
• perform SimpleSwap(k)
• swap µk,j with µk−1,j, for 1 ≤ j < k − 1,
•
(
µi,k−1
µi,k
)
=
(
µi,k−1µpk,k−1 + µi,k‖b∗k‖2/‖b∗pk−1‖2
µi,k−1 − µi,kµk,k−1
)
for k + 1 ≤ i < n.
High Performance Lattice Reduction on Heterogeneous Computing Platform 5
Algorithm 1 CR-AS-LLL OpenMP pseudocode
1: Input: [B1,B2, . . . ,Bm], [B∗1,B
∗
2, . . . ,B
∗
m], [U1,U2, . . . ,Um], δ
2: Output: [B1,B2, . . . ,Bm] as LLL reduced basis
3: maxT ← set the maximum number of available OpenMP threads
4: simMat← set the number of matrices processed simultaneously
5: TPM = maxT/simMat . The number of threads for parallel processing one matrix
6: #pragma omp parallel numthreads(simMat) {
7: grp← set current thread id
8: odd = true, even = true, off = 0, i = grp · (m/simMat)
9: #pragma omp parallel numthreads(TPM) shared(odd, even, off) firstprivate(grp) {
10: while (i < (grp+ 1) ·MPG) do
11: while (odd or even) do
12: #pragma omp single {
13: if off == 0 then odd = false, off = 1 else even = false, off = 0 end if
14: }
15: #pragma omp for reduction(‖:odd,even)
16: for k = 2 + off to n step 2 . Embarrassingly parallel for all k
17: Update GS coefficient µk,k−1 and SimpleSizeReduce(Bi,k,k − 1)
18: if ‖b∗k‖2 < (δ − µ2k,k−1)‖b∗k−1‖2 then
19: Perform SimpleSwap(Bi,k)
20: if(off == 0) then even = true else odd = true end if
21: end if
22: end for
23: end while
24: #pragma omp barrier
25: Update all of the GS coefficients ofBi and perform SimpleSizeReduce if necessary
. Highly parallel
26: #pragma omp single {i← i+ 1, odd = true, even = true}
27: end while
28: }
29: }
In Alg. 1, the OpenMP implementation of the CR-AS-LLL is presented. Two-
level parallelism is implemented based on a nested parallelism construct. The outer
level parallelism starts the concurrent processing of simMat number of lattice
basis and the inner parallel construct is responsible for the parallel LR of a basis
with TPM number of threads. As the outer level parallelism is expanded, namely
simMat is increased, the threads available for the parallel LR are decreased.
When mapped to the GPU, the performance of the CR-AS-LLL algorithm de-
pends on the efficiency of the work distribution among the available GPU threads
and the implementation of the most frequently used operations, such as dot prod-
ucts, size reductions and column swaps. In Alg. 2, the CUDA pseudo-code is pre-
sented with a two dimensional thread block TB(Tx, Ty) configuration, where Tx
and Ty denotes the number of threads in the x and y dimension. The kernel is
launched with a one dimensional grid whose size is determined by the number
of basis processed simultaneously. The number of threads Ty is defined based on
the size of the original basis, i.e. Ty = max (n/2, 32). By enabling the usage of
Tx = max (n, 32) threads in the x dimension, the threads that belong to the same
y dimension will form a warp, consequently the global memory loads and stores
issued by the threads of the warp will be coalesced. The y dimension also defines
the extent of paralellism. The iteration variable of the for loop is increased in every
iteration by Ty · 2. In other words, in every phase the threads with the same idy
have to reduce and swap at most n/(Ty · 2) vectors.
6 Csaba M. Jo´zsa et al.
Algorithm 2 CR-AS-LLL CUDA kernel pseudocode - processing of one lattice
basis Bi with 2 dimensional thread block configuration TB(Tx, Ty)
1: Input: Bi,B
∗
i ,U, δ and thread identifiers idx, idy
2: Output: Bi as a LLL reduced basis
3: Definition of shared arrays buf1[Ty ][Tx], buf2[Ty ][Tx], µ[Ty ]
4: Definition of shared variables odd = true, even = true and private variable off
5: Copy the elements above the diagonal from U to shared array U\[n− 1]
6: while odd or even do . Tx · Ty threads are working on the while loop
7: off = (off + 1) mod 2
8: for k = idy ∗ 2 + 1 + off to n step k += Ty ∗ 2 do
9: Call DotProduct(b∗k−1,b
∗
k−1,buf1[idy ][]), DotProduct(bk,b
∗
k−1,buf2[idy ][])
10: Threads with (idx == 0) set U\[k − 1] = buf2[idy ][0]/buf1[idy ][0]
11: if |U\[k − 1]| > 0.5 then . Check reduction criteria
12: Threads with (idx == 0) set µ[idy ] = dU\[k − 1]c
13: Call SimpleSizeReduce(bk,bk,bk−1, µ[idy ])
14: end if
15: Call DotProduct(b∗k,b
∗
k,buf2[idy ][])
16: if buf2[idy ][0] < (δ − U\[k − 1]2) · buf1[idy ][0] then
17: Call SimpleSwap(bk,bk−1, buf1[idy ][])
18: Call SimpleSizeReduce(b∗p,b∗k,b
∗
(k−1), U\[k − 1])
19: Call DotProduct(b∗p,b∗p,buf1[idy ][]), DotProduct(b∗k−1,b
∗p,buf2[idy ][])
20: Threads with (idx == 0) set U\[k − 1] = buf2[idy ][0]/buf1[idy ][0] and set odd
or even to true depending on the off variable
21: Call SimpleSizeReduce(b∗k,b
∗
(k−1),b
∗p, U\[k − 1]) and update b∗(k−1) = b∗p
22: end if
23: end for
24: Synchronize threads
25: end while
26: Copy the U\ to the diagonal elements of U
27: Update the rest of GS coefficients based on the procedures and methods presented above
28: procedure DotProduct(v1, v2, buf [Tx]) . The result is stored in buf at index 0
29: buf [idx] = 0
30: for i = idx to n step i += Tx do buf [idx] += v1i · v2i end for
31: for stride = Tx/2 to stride > 0 step stride >>= 1 do
32: if idx < stride then buf [idx] += buf [idy ][stride+ idx] end if
33: end for
34: end procedure
35: procedure SimpleSizeReduce(v1, v2, v3, µ)
36: for i = idx to n step i += Tx do v1i = v2i − µ · v3i end for
37: end procedure
38: procedure SimpleSwap(v1, v2, buf [Tx])
39: for i = idx to n step i += Tx do
40: (i.) buf [idx] = v1i, (ii.) v1i = v2i, (iii.) v2i = buf [idx]
41: end for
42: end procedure
The elements of matrices B,B∗,U are stored in the global memory of the
GPU. Since the size of shared memory is limited, it is not possible to load the
entire matrices in this low latency memory. Furthermore, the excessive use of
shared memory is decreasing the occupancy, resulting in performance degradation.
Shared buffers buf1[Ty][Tx] and buf2[Ty][Tx] are allocated in order to efficiently
compute the dot products and the vector norms. In order to avoid unnecessary
access to the global memory, the GS coefficients right above the diagonal are also
stored in the shared buffer U\ due to their frequent access.
High Performance Lattice Reduction on Heterogeneous Computing Platform 7
A major difference between the CUDA and OpenMP mapping lies in the im-
plementation of the size reductions, dot products and swaps. In the CUDA im-
plementation, because of the two dimensional TB configuration, Tx number of
threads are working in every procedure. For example, in the dot product calcu-
lation every thread has to do n/Tx number of multiplications and the result of
the multiplication is added to the shared memory buffer. When the execution of
all the thread finishes, a parallel prefix sum is applied on the buffer in order to
conclude the dot product computation. In case of the OpenMP implementation,
only one thread is working in the computation of a dot product since the number
of threads are limited.
3.2 Mapping details of the CR-MB-LLL algorithm
As stated in the previous sections the MB-LLL algorithm allows to split a large
matrix in several smaller submatrices where parallel LR is performed in a block-
wise manner with the parallel LR algorithm CR-AS-LLL. Once the LR of the
submatrices is finished, the boundaries between adjacent submatrices are checked
and finally the GS coefficients outside the initial groups are updated. The main
condition is to keep every submatrix as an LLL-reduced matrix throughout the
processing. In [10], a detailed description of the MB-LLL algorithm is shown.
The proposed CR-MB-LLL algorithm further reduces the computational com-
plexity of the MB-LLL algorithm. In the MB-LLL algorithm, the submatrices
affected by a boundary swap have to be LLL-reduced and the GS coefficients have
to be updated. Moreover, in order to maintain the LLL conditions in the subma-
trices affected by a boundary swap, the Swap procedure has to be performed.
The complexity reduction is achieved by eliminating the GS coefficients update
in the submatrices after the execution of the CR-AS-LLL and the usage of the
SimpleSwap procedure instead of Swap in case of a boundary swap. Since the
GS coefficients are updated only when the ordering condition (2) is met for every
column vector, the processing time can be considerably reduced.
In the following, algorithm mappings are proposed for a (i.) GPU, (ii.) multi-
core CPU architecture and (iii.) a heterogeneous system.
The GPU mapping of the CR-MB-LLL algorithm is similar to the one presented
in Section 3.1, since the procedures used are performed with a two dimensional
TB configuration even in the case of a boundary check. The main difference is
that dynamic parallelism (DP) enables the launch of new kernels from the GPU
without returning the program flow control to the CPU. DP is a feature that was
introduced in CUDA 5.0 and device compute capability 3.5 is required.
The schematic of the kernels scheduling implementing the DP is shown in
Fig. 1. The CPU launches the Block kernel. The size of the grid is equal to the
number of matrices that are simultaneously processed and the number of threads
in one TB is equal to the number of submatrices. In this case, every thread has
to prepare the data for the corresponding submatrices and launch the CR-AS-
LLL kernel. The kernel has to be relaunched if the LLL conditions were broken
by a boundary swap, which can be solved by tracking state variables placed in
the global memory. When all the submatrices are reduced, the Boundaries Check
(BC) kernel is launched. Since the operations performed in this section are dot
products and column swaps, the thread configuration of the TB is the same as
8 Csaba M. Jo´zsa et al.
Fig. 1: Kernels scheduling on dynamic parallelism enabled GPU for the CR-MB-LLL algorithm.
in case of the CR-AS-LLL kernel. The CR-AS-LLL and BC kernels are repeated
until there are no swaps on the boundaries. Because one matrix is assigned to
one TB in the parent Block kernel, the processing of the different matrices can
be done simultaneously despite the variable number of iterations. Finally, the GS
coefficients outside the blocks are updated with the GSC-Update kernel and the
size-reduction is performed wherever is needed.
Algorithm 3 The mapping of the CR-MB-LLL on the heterogeneous platform
1: Input: [B1,B2, . . . ,Bm], δ, block-size l, T number of OpenMP threads
2: Output: [B1,B2, . . . ,Bm] as LLL reduced basis
3: #pragma omp parallel {
4: mpt = m/T . The number of matrices that have to processed by one thread
5: bpm = n/l . The number of blocks per matrix
6: Assign a CUDA streamid to the current CPU thread with identifier id
7: Define arrays matIndD[mpt] on the GPU and matIndH[mpt] on the host . The indexes
of the unprocessed matrices are stored in these arrays
8: for i = 0 to mpt step i++ do matIndH[i] = id ·mpt+ i end for
9: Define arrays boundaryExchD[mpt · bpm] and boundaryExchH[mpt · bpm]
10: while mpt > 0 do
11: Asynchronously copy matIndH to matIndD on streamid
12: Launch CR − AS − LLL kernel on streamid with grid size gridlll = mpt · bpm and
TB(Tx, Ty) . The CR-AS-LLL is performed on the submatrices, without updating the
GS coefficients
13: Launch the BoundaryCheck kernel on streamid with grid size gridbc = mpt ·(bpm−1)
and TB(Bx, By) . The LLL conditions (1) and (2) are checked on the boundary of two
adjacent submatrices. In case if the conditions are not met the SimpleSwap is executed
instead of the Swap procedure.
14: Asynchronously copy boundaryExchD to boundaryExchH on stream id
15: Synchronize CPU thread with streamid
16: if There was no boundary exchange for one matrix then
17: Remove the matrix index from matIndH and mpt← mpt− 1 . The CPU threads
have to process the result of the boundary exchange
18: end if
19: end while
20: Launch the GSC − Update kernel on streamid . In this kernel all the GS coefficients are
updated and size reduction is performed where necessary.
21: }
High Performance Lattice Reduction on Heterogeneous Computing Platform 9
Fig. 2: Kernels scheduling on the heterogeneous platform for the CR-MB-LLL algorithm.
The schematic of the heterogeneous platform is shown in Fig. 2. The CPU
threads are responsible for launching CR-AS-LLL, BC and GSC-Update kernels,
update the state variables and implement the control logic of the dynamic schedul-
ing. The mapping of the CR-MB-LLL algorithm on the heterogeneous platform is
presented in Alg. 3.
A different CUDA stream is assigned for every CPU thread, making possible
the concurrent kernel execution and reducing the idle time of the CUDA cores.
Before launching the CR-AS-LLL and BC kernels, the CPU thread updates the
matIndD array placed in the GPU’s global memory to specify which matrices
need further processing. The size of the grid is dynamically adjusted according
to the number of non-processed matrices in every iteration. After the Boundary
Check kernel is executed and the boundaryExchH is updated on the host, the
CPU thread checks if the LR of any matrix is finished. If LLL reduced matrices
are found, the matIndH is updated and consequently the size of the grids assigned
to the CR-AS-LLL and BC kernels is decreased. The GSC-Update kernel starts
after all the matrices assigned to one CPU thread have been completely processed.
The control structure required by the multi-core architecture is similar to the
one presented in the heterogeneous platform. The difference is that instead of
launching GPU kernels, the master threads fork a specified number of slave threads
that are processing the submatrices in parallel. The parallel LR of the submatrices
is performed according to Alg. 1. In this case the very limited number of CPU
threads restrict the exploitation of several levels of parallelism.
4 Evaluation results
In this section we present the performance results of the proposed algorithms. The
computations were done in single-precision floating point arithmetic and parameter
δ = 0.75 was used for the LLL condition (2). Block-Toeplitz matrices have been
considered to evaluate the performance of the different implementations. These
type of matrices are usually used in wireless communications [17].
Figure 3 shows the computational times of the MB-LLL algorithm based on
the architectures discussed in Section 3.2 for different matrix dimensions, where
10 Csaba M. Jo´zsa et al.
 
 
MB−LLL GPU with DP
MB−LLL GPU+CPU
MB−LLL CPU
Matrix dimension
23 24 25 26 27 28 29 210
10-6
10-5
10-4
10-3
T
im
e 
(s
)
10-2
10-1
l = 4
l = 8
l = 16
l = 32
l = 8
l = 4
l = 8
l = 32
l = 4
l = 8
l = 16
l = 32
l = 64
l = 128
l = 256
l = 512
l = 4
l = 8
l = 16
l = 32
l = 8
l = 16
l = 16
l = 32
Fig. 3: Computational time of the MB-LLL
algorithm on different architectures, where l
denotes the size of the processed blocks.
 
 
CR−AS−LLL CPU
MB−LLL CPU
CR−MB−LLL CPU
CR−AS−LLL GPU
MB−LLL GPU+CPU
CR−MB−LLL GPU+CPU
Matrix dimension
23 24 25 26 27 28 29 210
10-7
10-6
10-5
10-4
10-3
T
im
e 
(s
)
10-2
10-1
Fig. 4: Computational times of algorithms
CR-AS-LLL, MB-LLL and CR-MB-LLL for
matrix dimensions 23 − 210.
l denotes the size of the processed blocks. The performance measurements were
evaluated with all the possible block sizes and the best configuration is shown. The
architectures used for the computational time measurements are the Nvidia Tesla
K20 (DP capability) and an Intel Core i7-3820 processor. The processing times
show similar performance for large matrices when the GPU is involved. However,
the heterogeneous platform clearly outperforms the solution based on DP in the
case of small matrices. This gap is caused by the overhead required when launching
kernels from kernels with DP and the limited overlapping execution of kernels on
different streams. The conclusion is that the data transfer between CPU and GPU
required by the heterogeneous system is less time consuming than the overhead
of the kernel launch with DP and the limitation of the concurrent execution of
kernels on different streams.
Figure 4 compares the average computational time of the CR-AS-LLL, MB-
LLL and CR-MB-LLL algorithms for different matrix dimensions. The algorithms
were evaluated on the Intel Core i7-3820 CPU, the Nvidia GeForce GTX 690
GPU and the heterogeneous system containing the previously mentioned CPU
and GPU. In [7], it was shown that the GTX 690 has a better performance than
the K20 when performing LR, thus the DP is not required. Regarding the GPU and
comibned CPU+GPU implementations the following conclusions can be drawn: (i.)
the computational time of the CR-MB-LLL is 25− 40% lower in case of small and
medium-sized matrices compared to the MB-LLL algorithm and the performance
is similar in case of larger matrices, (ii.) the CR-AS-LLL performs better than
the CR-MB-LLL in case of small matrices, however for large matrices the block
concept implemented in the CR-MB-LLL achieves 30% speed-up compared to the
CR-AS-LLL and (iii.) the systems using the GPU outperform the CPU for every
matrix dimension with speed-ups ranging from 6 to 15.
Regarding the CPU implementations the following conclusions can be drawn:
(i.) the CR-MB-LLL always outperforms the MB-LLL algorithm with speed-ups
ranging from 2 to 7, (ii.) the CR-AS-LLL algorithm performs better than the
MB-LLL and CR-MB-LLL for matrices with low dimensions (23 − 26), (iii.) the
computational time of the CR-MB-LLL is 10−20% lower in case of larger matrices
compared to the CR-AS-LLL.
High Performance Lattice Reduction on Heterogeneous Computing Platform 11
Ref Algorithm Architecture 4× 4 8× 8 64× 64 1024× 1024
[3] Clarksons Algorithm Virtex-II-Pro FPGA 4.2× 10−6 x x x
[9] Complex LLL Virtex-5 FPGA 0.79× 10−6 x x x
[6] Brun’s Algorithm ASIC 250 nM 0.07× 10−6 x x x
[5] Reverse Siegel LLL Virtex-4 FPGA 0.18× 10−6 x x x
[5] Reverse Siegel LLL ASIC 130 nM 0.04× 10−6 x x x
[1] SB-LLL ADRES 0.17× 10−6 x x x
This work CR-AS-LLL GTX690 GPU x 0.33× 10−6 1.37× 10−5 1.67× 10−2
This work CR-MB-LLL
GTX690 GPU +
x 0.77× 10−6 1.30× 10−5 1.28× 10−2
Intel i7-3820 CPU
Table 1: Performance comparison of different lattice reduction implementations.
A surprising result is that, while the CR-MB-LLL achieves a significant speed-
up compared to the MB-LLL for the CPU architecture, the same is not true for
GPU architecture. This fact is due to several reasons. The computational complex-
ity reductions for the CR-MB-LLL affect only the CR-AS-LLL and BC kernels.
However, in case of large matrices, the GSC-Update kernel is taking the major part
of the processing time. This kernel has to access the global memory frequently and
these accesses have a high latency. In case of the CPU, this problem is alleviated
by the high speed memory access and the large amount of available cache for CPU.
The performance of LR mostly depends on the precision of the computation,
the size and type of the basis matrix and the architecture used. In Table 1 perfor-
mance of existing implementations are presented. Previous research mostly focused
on small matrices. In [2] performance measures for higher dimension matrices are
presented as well, however the total runtime of the algorithm is not specified.
5 Conclusions
In this paper, we proposed the CR-MB-LLL algorithm and we have compared it
with the CR-AS-LLL and MB-LLL algorithms presented in [10]. The idea behind
the CR-MB-LLL algorithm is the relaxation of LLL condition (1) for the sub-
matrices, resulting in the delay of the GS coefficients update when executing the
LR and the replacement of the Swap procedure by the less costly SimpleSwap
procedure when performing the boundary checks.
The CR-MB-LLL algorithm has been evaluated on several architectures: a
multi-core architecture, a GPU with DP capability and a heterogeneous platform
based on a CPU and GPU. Results show that mapping the CR-MB-LLL algo-
rithm on the heterogeneous architecture reduces the computational time by 30%
compared to the CR-AS-LLL in case of large matrices, whereas implementations
involving GPUs achieve speed-up factors from 6− 15 compared to the multi-core
CPU architecture. The MB-LLL algorithm achieves speed-up factors ranging from
5 − 25 when launched on the proposed heterogeneous platform compared to the
DP-based GPU implementation for matrix dimensions ranging 23 − 26.
It was shown that the efficiency of the CR-MB-LLL is significantly affected
by the architectures used. The CR-MB-LLL is 1.5 − 7 times faster compared to
the MB-LLL algorithm when launched on multi-core CPU architecture, however
the CR-MB-LLL is only at most 1.4 times faster compared to the MB-LLL when
launched on the GPU architecture. This is mainly because the computational
complexity reductions introduced in the CR-MB-LLL algorithm affect the CR-AS-
LLL and BC kernels. However, in case of large matrices the GSC-Update kernel is
12 Csaba M. Jo´zsa et al.
taking the major part of the processing time with frequent accesses to the global
memory of the GPU. In case of the CPU, the memory access has a lower latency
and the available cache for CPU is significantly bigger, causing different speed-ups
of the same algorithm on the different architectures.
Acknowledgements Financial support for this study was provided by grants TA´MOP-
4.2.1./B-11/2/KMR-2011-0002, TA´MOP-4.2.2/B-10/1-2010-0014 from the Pa´zma´ny Pe´ter
Catholic University, European Union ERDF, Spanish Government through TEC2012-38142-
C04-01 project and Generalitat Valenciana through PROMETEO/2009/013 project.
References
1. Ahmad, U., Amin, A., Li, M., Pollin, S., Van der Perre, L., Catthoor, F.: Scalable block-
based parallel lattice reduction algorithm for an SDR baseband processor. In: Communi-
cations (ICC), 2011 IEEE International Conference on (2011)
2. Backes, W., Wetzel, S.: Parallel lattice basis reduction - the road to many-core. In: High
Performance Computing and Communications (HPCC), 2011 IEEE 13th International
Conference on (2011)
3. Barbero, L.G., Milliner, D.L., Ratnarajah, T., Barry, J.R., Cowan, C.: Rapid prototyping
of Clarkson’s lattice reduction for MIMO detection. In: Communications, 2009. ICC’09.
IEEE International Conference on, pp. 1–5 (2009)
4. Bremner, M.R.: Lattice basis reduction: An introduction to the LLL algorithm and its
applications. CRC Press (2012)
5. Bruderer, L., Studer, C., Wenk, M., Seethaler, D., Burg, A.: VLSI implementation of a
low-complexity LLL lattice reduction algorithm for MIMO detection. In: Circuits and
Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on (2010)
6. Burg, A., Seethaler, D., Matz, G.: VLSI implementation of a lattice-reduction algorithm
for multi-antenna broadcast precoding. In: Circuits and Systems, 2007. ISCAS 2007. IEEE
International Symposium on, pp. 673–676 (2007)
7. Domene, F., Jo´zsa, C.M., Vidal, A.M., Pin˜ero, G., Gonzalez, A.: Performance analysis of
a parallel lattice reduction algorithm on many-core architectures. In: Proceedings of the
13th International Conference on Computational and Mathematical Methods in Science
and Engineering (2013)
8. Gestner, B., Zhang, W., Ma, X., Anderson, D.: Lattice reduction for MIMO detection:
From theoretical analysis to hardware realization 58(4), 813–826 (2011)
9. Gestner, B., Zhang, W., Ma, X., Anderson, D.V.: VLSI implementation of a lattice reduc-
tion algorithm for low-complexity equalization. In: Circuits and Systems for Communica-
tions, 2008. ICCSC 2008. 4th IEEE International Conference on, pp. 643–647 (2008)
10. Jo´zsa, C.M., Domene, F., Pin˜ero, G., Gonza´lez, A., Vidal, A.M.: Efficient GPU imple-
mentation of lattice-reduction-aided multiuser precoding. In: Wireless Communication
Systems (ISWCS 2013), Proceedings of the Tenth International Symposium on (2013)
11. Lenstra, A.K., Lenstra, H.W., Lova´sz, L.: Factoring polynomials with rational coefficients.
Mathematische Annalen 261(4), 515–534 (1982)
12. Luo, Y., Qiao, S.: A parallel LLL algorithm. In: Proceedings of The Fourth International
C* Conference on Computer Science and Software Engineering, pp. 93–101 (2011)
13. Shabany, M., Youssef, A., Gulak, G.: High-throughput 0.13-µm CMOS lattice reduction
core supporting 880 Mb/s detection. Very Large Scale Integration (VLSI) Systems, IEEE
Transactions on 21(5) (2013)
14. Villard, G.: Parallel lattice basis reduction. In: Papers from the international symposium
on Symbolic and algebraic computation, ISSAC ’92. ACM, New York, NY, USA (1992)
15. Wetzel, S.: An efficient parallel block-reduction algorithm. In: Algorithmic Number The-
ory, pp. 323–337. Springer (1998)
16. Wu, D., Eilert, J., Liu, D.: A programmable lattice-reduction aided detector for MIMO-
OFDMA. In: Circuits and Systems for Communications, 2008. ICCSC 2008. 4th IEEE
International Conference on, pp. 293–297 (2008)
17. Wubben, D., Seethaler, D., Jalde´n, J., Matz, G.: Lattice reduction 28(3), 70–91 (2011)
