The lattice reduction (LR) technique has become very important in many engineering fields. However, its high complexity makes difficult its use in real-time applications, especially in applications that deal with large matrices. As a solution, the modified block LLL (MB-LLL) algorithm was introduced, where several levels of parallelism were exploited: (a) fine-grained parallelism was achieved through the cost-reduced all-swap LLL (CR-AS-LLL) algorithm introduced together with the MB-LLL by Józsa et al. (Proceedings of the tenth international symposium on wireless communication systems, 2013) and (b) coarse-grained parallelism was achieved by applying the block-reduction concept presented by Wetzel (Algorithmic number theory. Springer, New York, pp [323][324][325][326][327][328][329][330][331][332][333][334][335][336][337] 1998). In this paper, we present the cost-reduced MB-LLL (CR-MB-LLL) algorithm, which allows to significantly reduce the computational complexity of the MB-LLL by allowing the relaxation of the first LLL condition while executing the LR of submatrices, resulting in the delay of the Gram-Schmidt coefficients update and by using less costly procedures during the boundary checks. The effects of complexity reduction and implementation details are analyzed and discussed for several architectures. A mapping of the CR-MB-LLL on a heterogeneous platform is proposed and it is compared with implementations running on a dynamic parallelism enabled GPU and a multi-core CPU. The mapping on the architecture proposed allows a dynamic scheduling of kernels where the overhead introduced is hidden by the use of several CUDA streams. Results show that the execution time of the CR-MB-LLL algorithm on the heterogeneous platform outperforms the multi-core CPU and it is more efficient than the CR-AS-LLL algorithm in case of large matrices.
the parallel block-reduction concept presented in [2] , a higher level, coarse-grained parallelism can be applied as an extra level of parallelism. The idea is to subdivide the original lattice basis matrix in several smaller submatrices and perform an independent LR on them followed by a boundary check between adjacent submatrices. The modified block LLL (MB-LLL) algorithm proposed in [1] implements a parallel processing of the submatrices by using the parallel CR-AS-LLL algorithm for the LR of every submatrix.
The implementations of the previous references make use of only one architecture to calculate the LR of a basis. However, a better performance can be obtained by combining different architectures, what is known as heterogeneous computing. Among the different combinations, the use of CPU and GPU is probably the most popular since it can be found in most of the computers.
In this paper, we present the cost-reduced MB-LLL (CR-MB-LLL) algorithm in order to further reduce the computational complexity of the MB-LLL algorithm [1] . The main idea behind the CR-MB-LLL algorithm is the relaxation of the first LLL condition while executing the LR for the submatrices, resulting in the delay of the Gram-Schmidt (GS) coefficients update and by using less costly procedures when performing the boundary checks. The effects of this complexity reduction are evaluated on different architectures.
A mapping of the CR-MB-LLL algorithm on a heterogeneous platform consisting of a CPU and a GPU is proposed and it is compared with implementations running on a GPU with dynamic parallelism (DP) capability and a multi-core CPU architecture. The proposed architecture allows a dynamic scheduling of kernels where the overhead introduced by host-device communication is hidden by the use of CUDA streams. Results show that the CR-MB-LLL algorithm executed on the heterogeneous platform outperforms the DP-based GPU and multi-core CPU implementations.
The algorithm mappings on different parallel architectures is very challenging, since the number of processing cores, latency and size of different memories available and cache size significantly differ. The mapping details of the CR-AS-LLL and CR-MB-LLL algorithms to different parallel architectures is also presented with a special emphasis on the work distribution among the threads and the efficient memory utilization.
The paper is organized as follows. In Sect. 2 a brief introduction to LR is given. The architecture dependent mapping details of the CR-AS-LLL and CR-MB-LLL are presented in Sect. 3, followed by their performance evaluations on different platforms in Sect. 4. Finally, conclusions are stated in Sect. 5.
Problem description
. . , n is a discrete additive subgroup of R n , where b 1 , b 2 , . . . , b n ∈ R n are linearly independent vectors and Z denotes the set of integers. Let matrix B = (b 1 , . . . , b n ) ∈ R n×n denote the full (column) ranked basis of the lattice. Let B * = (b * 1 , . . . , b * n ) ∈ R n×n denote the associated orthogonal basis of B, calculated by the GS orthogonalization process as
for 1 ≤ j < i ≤ n, also called the GS coefficients and (·, ·) denotes the ordinary dot product on R n . Thus, matrix B can be expressed as B = B * · U, where matrix U is upper triangular with unit diagonal, and whose element (i, j) above the diagonal is given by the GS coefficient μ j,i .
Definition 1
Given a lattice L with basis B ∈ R n×n , associated orthogonal basis B * ∈ R n×n , and GS coefficients μ i, j , B is called LLL-reduced if the following conditions are satisfied:
Parallel lattice-reduction algorithms and their mapping to parallel architectures
Since, the LLL algorithm shows a highly sequential behavior, multiple levels of parallelism have to be identified and exploited in order to efficiently parallelize this algorithm. Dividing the problem in several sub-problems that can be executed concurrently, can be regarded as one level of parallelism. In addition, if a sub-problem could benefit from a multi-threaded environment it can be regarded as a second level of parallelism. Previous parallel LR implementations, such as the ones presented in [10] [11] [12] , have focused only on multi-core architectures. The main drawback of the low number of threads offered by modern CPUs (compared to GPUs) is that low level parallelism can not be efficiently exploited. During an algorithm design, low level parallelism is usually omitted and the levels of parallelism are also restricted. In case of GPUs, the high number of CUDA cores makes possible the parallel execution of a high number of threads leading to significant performance improvements.
Mapping details of the CR-AS-LLL algorithm
Basically, LR consists of a succession of swaps between vectors of the basis and some operations to decrease their norms. The order in which the swaps are applied in the LLL algorithm is limiting in a parallel framework. Villard [13] introduced the any swap reduction concept, that enables simultaneous basis swaps and served as a basis for future parallel implementations. In [10] , the concept of delaying the size reductions was introduced. In the CR-AS-LLL algorithm, further computational cost is saved by rearranging and delaying the frequently used size reduction procedure. Procedures SimpleSizeReduce, SimpleSwap and Swap are defined in order to give an accurate description of the CR-AS-LLL algorithm. 
2 , the following updates have to be applied: 
2 , the following updates have to be applied:
In Algorithm 1, the OpenMP implementation of the CR-AS-LLL is presented. Twolevel parallelism is implemented based on a nested parallelism construct. The outer level parallelism starts the concurrent processing of sim Mat number of lattice basis and the inner parallel construct is responsible for the parallel LR of a basis with T P M number of threads. As the outer level parallelism is expanded, namely sim Mat is increased, the threads available for the parallel LR are decreased.
When mapped to the GPU, the performance of the CR-AS-LLL algorithm depends on the efficiency of the work distribution among the available GPU threads and the implementation of the most frequently used operations, such as dot products, size reductions and column swaps. In Algorithm 2, the CUDA pseudo-code is presented with a two dimensional thread block TB(T x , T y ) configuration, where T x and T y denotes the number of threads in the x and y dimension. The kernel is launched with a one dimensional grid whose size is determined by the number of basis processed simultaneously. The number of threads T y is defined based on the size of the original basis, i.e. T y = max (n/2, 32). By enabling the usage of T x = max (n, 32) threads in the x dimension, the threads that belong to the same y dimension will form a warp, consequently the global memory loads and stores issued by the threads of the warp will be coalesced. The y dimension also defines the extent of parallelism. The iteration variable of the for loop is increased in every iteration by T y · 2. In other words, in every phase the threads with the same id y have to reduce and swap at most n/(T y ·2) vectors.
The elements of matrices B, B * , U are stored in the global memory of the GPU. Since the size of shared memory is limited, it is not possible to load the entire matrices in this low latency memory. Furthermore, the excessive use of shared memory is decreasing the occupancy, resulting in performance degradation. Shared buffers buf 1 
are allocated in order to efficiently compute the dot products and the vector norms. In order to avoid unnecessary access to the global memory, the GS coefficients right above the diagonal are also stored in the shared buffer U \ due to their frequent access. 
Algorithm 1 CR-AS-LLL
The number of threads for parallel processing one matrix 6: #pragma omp parallel numthreads(sim Mat) { 7: gr p ← set current thread id 8: odd = true, even = true, of f = 0, i = gr p · (m/sim Mat) 9: #pragma omp parallel numthreads(T P M) shared(odd, even, of f ) firstprivate(gr p) { 10:
while (odd or even) do 12:
#pragma omp single { 13:
#pragma omp for reduction( :odd,even) 16:
for k = 2 + of f to n step 2 Embarrassingly parallel for all k 17: A major difference between the CUDA and OpenMP mapping lies in the implementation of the size reductions, dot products and swaps. In the CUDA implementation, because of the two dimensional TB configuration, T x number of threads are working in every procedure. For example, in the dot product calculation every thread has to do n/T x number of multiplications and the result of the multiplication is added to the shared memory buffer. When the execution of all the thread finishes, a parallel prefix sum is applied on the buffer in order to conclude the dot product computation. In case of the OpenMP implementation, only one thread is working in the computation of a dot product since the number of threads are limited.
Mapping details of the CR-MB-LLL algorithm
As stated in the previous sections the MB-LLL algorithm allows to split a large matrix in several smaller submatrices where parallel LR is performed in a block-wise manner with the parallel LR algorithm CR-AS-LLL. Once the LR of the submatrices is finished, the boundaries between adjacent submatrices are checked and finally the GS coefficients outside the initial groups are updated. The main condition is to keep every of f = (of f + 1) mod 2 8:
for k = id y * 2 + 1 + of f to n step k += T y * 2 do 9:
Call DotProduct(b
Threads with (id
and set odd or even to true depending on the of f variable 21:
Call
end if 23:
end for 24:
Synchronize threads 25: end while 26: Copy the U \ to the diagonal elements of U 27: Update the rest of GS coefficients based on the procedures and methods presented above 28: procedure DOTPRODUCT(v 1 
The result is stored in bu f at index 0 29: 
end for 42: end procedure submatrix as an LLL-reduced matrix throughout the processing. In [1] , a detailed description of the MB-LLL algorithm is shown.
The proposed CR-MB-LLL algorithm further reduces the computational complexity of the MB-LLL algorithm. In the MB-LLL algorithm, the submatrices affected by a boundary swap have to be LLL-reduced and the GS coefficients have to be updated. Moreover, in order to maintain the LLL conditions in the submatrices affected by a boundary swap, the Swap procedure has to be performed. The complexity reduction is achieved by eliminating the GS coefficients update in the submatrices after the execution of the CR-AS-LLL and the usage of the SimpleSwap procedure instead of Swap in case of a boundary swap. Since the GS coefficients are updated only when the ordering condition (2) is met for every column vector, the processing time can be considerably reduced.
In the following, algorithm mappings are proposed for a (a) GPU, (b) multi-core CPU architecture and (c) a heterogeneous system.
The GPU mapping of the CR-MB-LLL algorithm is similar to the one presented in Sect. 3.1, since the procedures used are performed with a two dimensional TB configuration even in the case of a boundary check. The main difference is that dynamic parallelism enables the launch of new kernels from the GPU without returning the program flow control to the CPU. DP is a feature that was introduced in CUDA 5.0 and device compute capability 3.5 is required.
The schematic of the kernels scheduling implementing the DP is shown in Fig. 1 . The CPU launches the Block kernel. The size of the grid is equal to the number of matrices that are simultaneously processed and the number of threads in 1 TB is equal to the number of submatrices. In this case, every thread has to prepare the data for the corresponding submatrices and launch the CR-AS-LLL kernel. The kernel has to be relaunched if the LLL conditions were broken by a boundary swap, which can be solved by tracking state variables placed in the global memory. When all the submatrices are reduced, the Boundaries Check (BC) kernel is launched. Since the operations performed in this section are dot products and column swaps, the thread configuration of the TB is the same as in case of the CR-AS-LLL kernel. The CR-AS-LLL and BC kernels are repeated until there are no swaps on the boundaries. Because one matrix is assigned to one TB in the parent Block kernel, the processing of the different matrices can be done simultaneously despite the variable number of iterations. Finally, the GS coefficients outside the blocks are updated with the GSCUpdate kernel and the size-reduction is performed wherever is needed. The LLL conditions (1) and (2) are checked on the boundary of two adjacent submatrices. In case if the conditions are not met the SimpleSwap is executed instead of the Swap procedure. 14:
Asynchronously copy boundar y E xch D to boundar y E xch H on stream id 15:
Synchronize CPU thread with stream id 16: if There was no boundary exchange for one matrix then 17:
Remove the matrix index from mat I nd H and mpt ← mpt − 1 The CPU threads have to process the result of the boundary exchange 18:
end if 19: end while 20: Launch the G SC − U pdate kernel on stream id In this kernel all the GS coefficients are updated and size reduction is performed where necessary. 21: } The schematic of the heterogeneous platform is shown in Fig. 2 . The CPU threads are responsible for launching CR-AS-LLL, BC and GSC-Update kernels, update the state variables and implement the control logic of the dynamic scheduling. The map-ping of the CR-MB-LLL algorithm on the heterogeneous platform is presented in Algorithm 3.
A different CUDA stream is assigned for every CPU thread, making possible the concurrent kernel execution and reducing the idle time of the CUDA cores. Before launching the CR-AS-LLL and BC kernels, the CPU thread updates the mat I nd D array placed in the GPU's global memory to specify which matrices need further processing. The size of the grid is dynamically adjusted according to the number of non-processed matrices in every iteration. After the Boundary Check kernel is executed and the boundar y E xch H is updated on the host, the CPU thread checks if the LR of any matrix is finished. If LLL reduced matrices are found, the mat I nd H is updated and consequently the size of the grids assigned to the CR-AS-LLL and BC kernels is decreased. The GSC-Update kernel starts after all the matrices assigned to one CPU thread have been completely processed.
The control structure required by the multi-core architecture is similar to the one presented in the heterogeneous platform. The difference is that instead of launching GPU kernels, the master threads fork a specified number of slave threads that are processing the submatrices in parallel. The parallel LR of the submatrices is performed according to Algorithm 1. In this case the very limited number of CPU threads restricts the exploitation of several levels of parallelism.
Evaluation results
In this section we present the performance results of the proposed algorithms. The computations were done in single-precision floating point arithmetic and parameter δ = 0.75 was used for the LLL condition (2). Block-Toeplitz matrices have been considered to evaluate the performance of the different implementations. These type of matrices are usually used in wireless communications [3] . Figure 3 shows the computational times of the MB-LLL algorithm based on the architectures discussed in Sect. 3.2 for different matrix dimensions, where l denotes the size of the processed blocks. The performance measurements were evaluated with all the possible block sizes and the best configuration is shown. The architectures used for the computational time measurements are the Nvidia Tesla K20 (DP capability) and an Intel Core i7-3820 processor. The processing times show similar performance for large matrices when the GPU is involved. However, the heterogeneous platform clearly outperforms the solution based on DP in the case of small matrices. This gap is caused by the overhead required when launching kernels from kernels with DP and the limited overlapping execution of kernels on different streams. The conclusion is that the data transfer between CPU and GPU required by the heterogeneous system is less time consuming than the overhead of the kernel launch with DP and the limitation of the concurrent execution of kernels on different streams. Figure 4 compares the average computational time of the CR-AS-LLL, MB-LLL and CR-MB-LLL algorithms for different matrix dimensions. The algorithms were evaluated on the Intel Core i7-3820 CPU, the Nvidia GeForce GTX 690 GPU and the heterogeneous system containing the previously mentioned CPU and GPU. In [14] , it was shown that the GTX 690 has a better performance than the K20 when perform- ing LR, thus the DP is not required. Regarding the GPU and combined CPU+GPU implementations the following conclusions can be drawn: (a) the computational time of the CR-MB-LLL is 25-40 % lower in case of small and medium-sized matrices compared to the MB-LLL algorithm and the performance is similar in case of larger matrices, (b) the CR-AS-LLL performs better than the CR-MB-LLL in case of small matrices, however for large matrices the block concept implemented in the CR-MB-LLL achieves 30 % speed-up compared to the CR-AS-LLL and (c) the systems using the GPU outperform the CPU for every matrix dimension with speed-ups ranging 6-15.
Regarding the CPU implementations the following conclusions can be drawn: (a) the CR-MB-LLL always outperforms the MB-LLL algorithm with speed-ups ranging 2-7, (b) the CR-AS-LLL algorithm performs better than the MB-LLL and CR-MB-LLL for matrices with low dimensions (2 3 − 2 6 ), (c) the computational time of the CR-MB-LLL is 10-20 % lower in case of larger matrices compared to the CR-AS-LLL.
A surprising result is that, while the CR-MB-LLL achieves a significant speed-up compared to the MB-LLL for the CPU architecture, the same is not true for GPU architecture. This fact is due to several reasons. The computational complexity reductions for the CR-MB-LLL affect only the CR-AS-LLL and BC kernels. However, in case of large matrices, the GSC-Update kernel is taking the major part of the processing time. This kernel has to access the global memory frequently and these accesses have a high latency. In case of the CPU, this problem is alleviated by the high speed memory access and the large amount of available cache for CPU.
The performance of LR mostly depends on the precision of the computation, the size and type of the basis matrix and the architecture used. In Table 1 performance of existing implementations are presented. Previous research mostly focused on small matrices. In [11] performance measures for higher dimension matrices are presented as well, however the total runtime of the algorithm is not specified.
Conclusions
In this paper, we proposed the CR-MB-LLL algorithm and we have compared it with the CR-AS-LLL and MB-LLL algorithms presented in [1] . The idea behind the CR-MB-LLL algorithm is the relaxation of LLL condition (1) for the submatrices, resulting in the delay of the GS coefficients update when executing the LR and the replacement of the Swap procedure by the less costly SimpleSwap procedure when performing the boundary checks.
The CR-MB-LLL algorithm has been evaluated on several architectures: a multicore architecture, a GPU with DP capability and a heterogeneous platform based on a CPU and GPU. Results show that mapping the CR-MB-LLL algorithm on the heterogeneous architecture reduces the computational time by 30 % compared to the CR-AS-LLL in case of large matrices, whereas implementations involving GPUs achieve speed-up factors from 6 to 15 compared to the multi-core CPU architecture. The MB-LLL algorithm achieves speed-up factors ranging 5-25 when launched on the proposed heterogeneous platform compared to the DP-based GPU implementation for matrix dimensions ranging 2 3 -2 6 .
It was shown that the efficiency of the CR-MB-LLL is significantly affected by the architectures used. The CR-MB-LLL is 1.5-7 times faster compared to the MB-LLL algorithm when launched on multi-core CPU architecture, however the CR-MB-LLL is only at most 1.4 times faster compared to the MB-LLL when launched on the GPU architecture. This is mainly because the computational complexity reductions introduced in the CR-MB-LLL algorithm affect the CR-AS-LLL and BC kernels. However, in case of large matrices the GSC-Update kernel is taking the major part of the processing time with frequent accesses to the global memory of the GPU. In case of Barbero et al. [7] Clarkson's algorithm Virtex-II-Pro the CPU, the memory access has a lower latency and the available cache for CPU is significantly bigger, causing different speed-ups of the same algorithm on the different architectures.
