In this paper, we present a GPU-based parallel algorithm for the Learning With Errors (LWE) problem using a lattice-based Bounded Distance Decoding (BDD) approach. To the best of our knowledge, this is the first GPU-based implementation for the LWE problem. Compared to the sequential BDD implementation of Lindner-Peikert and pruned-enumeration strategies by Kirshanova [1], our GPU-based implementation is almost faster by a factor 6 and 9 respectively. The used GPU is NVIDIA GeForce GTX 1060 6G. We also provided a parallel implementation using two GPUs. The results showed that our algorithm is scalable and faster than the sequential version (Lindner-Peikert and pruned-enumeration) by a factor of almost 13 and 16 respectively. Moreover, the results showed that our parallel implementation using two GPUs is more efficient than Kirshanova et al.'s parallel implementation using 20 CPU-cores.
,
Inner product |S| Number of elements of a set S Euclidean norm Z p Set of nonnegative integers less than p O(f ) Soft O notation v Row vector (usually in Z m ) a mod q Remainder after division of a by q poly (n) Polynomial function of degree n span{b 1 , . . . , b n } Span of a set of vectors over Z
I. INTRODUCTION
For the last few decades, the public-key cryptosystems based on the integer factorization problem, the discrete logarithm problem and the elliptic curve problem have been considered among the most secure public key cryptosystems. Recently, the emergence of the quantum computing has posed a major threat to these types of cryptosystems. In fact, there is no physical implementation of the quantum computing due to no existence of quantum computers until now. The wide evolution of the technology increases the feeling that the emerging of the quantum computers is not far off. As precautionary measures, the researchers started seeking alternatives The associate editor coordinating the review of this manuscript and approving it for publication was Chien-Ming Chen .
for unsecure cryptosystems in quantum computing. The most promising cryptosystems are those based on lattice problems such as GGH [2] , NTRU [3] and recently LWE-based cryptosystems [4] , [5] . The lattice problems are known to be post-quantum problems [6] . Up to now, there is no quantum algorithm that can solve the lattice problems in polynomial time.
There are many hard problems relating to lattices. The most common hard problems in lattices are SVP (Shortest Vector Problem) and CVP (Closest Vector Problem). In SVP, we aim to find a shortest vector for a given lattice L that is built up of an integral combination of m linearly independent vectors in Z n . While, in CVP, for a given target vector in R n and m linearly independent vectors (called basis) in Z n for a lattice L ⊂ Z n , we aim to retrieve a closest vector to the target vector in the lattice space.
The Learning With Errors (LWE) problem [4] is one of the modern problems based on CVP. Recently, it has been widely used for constructing many efficient cryptosystems such as public key encryptions [4] , [7] , [8] , digital signatures [9] , [10] , identity-based encryptions [11] - [13] and fully homomorphic encryptions [14] - [16] . In fact, LWE-based cryptosystems have attracted the attention of researchers as an anticipated candidate to cryptosystems that are vulnerable against quantum computing attacks.
The LWE problem was firstly introduced by Regev [4] . It is defined as follows: Let n and m be integers and q be a prime modulus such that q = poly(n). For a random matrix A ∈ Z m×n q , a secret vector s ∈ Z n q and a noise vector e ∈ Z m q , define a vector c = (As + e) mod q. Typically, the noise e is sampled according to Gaussian distribution with a standard deviation αq, where O(α) = 1.
The LWE problem is presented as the problem of finding the secret vector s, given n, q, the matrix A and the vector c. Regev [4] showed that the LWE problem on the average is at least as hard as approximating lattice problems on n-dimensional lattices, in the worst case to withinÕ(n/α) factors using a quantum algorithm.
There are three approaches for solving the LWE problem: Embedding approach [17] , BDD (Bounded Distance Decoding) approach [5] , [18] , [19] and the combinatorial BKW (Blum-Kalai-Wasserman) [20] - [22] . Embedding and BDD approaches are considered as the famous approaches, which are based on a strong lattice reduction. In Embedding approach, LWE is reduced to a SVP instance and solved by a SVP solver such as LLL (Lenstra-Lenstra-Lovász) [23] or BKZ (Blockwise Korkine-Zolotarev) [24] . Whereas, in BBD, the LWE problem is reduced to CVP. The last approach for solving LWE is the combinatorial BKW approach, which is unfeasible due to its requirement of exponentially samples [21] . In this paper, we concentrate on the BDD approach. The BDD approach consists of two stages. The first stage is a basis reduction in which the reduction is carried out using either LLL or BKZ. The main objective of this stage is to provide an improvement on the basis. The second stage is a search stage for a closer vector using the Babai's Nearest Plane algorithm; see Section IV. Lindner and Peikert [5] proposed an improvement of the Babai's Nearest Plane algorithm guaranteeing that the output vector is closer to the target vector. Liu and Nguyen [18] proposed a pruned-enumeration strategy to improve the Lindner-Peikert algorithm.
The BDD approach is applied by an enumeration search represented in a weighted tree. The branches of a weighted tree are independents, so the parallelism can be widely investigated through running the code on the branches of the tree in parallel. There are few papers introduced implementations of parallelism on the LWE problem using multi-cores [1] , [25] , [26] . But to the best of my knowledge, up to now there is no investigation of GPU (Graphics Processing Unit) computation on the LWE problem.
The high computation power of GPUs comparing to their cost, attracts the attention of the most researchers to investigate their parallelism algorithms using GPU. In recent years, GPU companies have made a big evolution in the GPU hardware architecture. Hundreds of cores with high computation power can be installed on the same chip, accompanied by a gigabyte memory.
In this paper, we aim to exploit the high computation power of GPUs in enhancing the execution time of the BDD approach. To the best of our knowledge, this is the first use of GPUs for solving the LWE problem. To show our significant improvement to the BDD approach using GPU, we have developed a parallel implementation of the BDD approach for solving LWE. Our experimental results of the BDD approach were conducted using the Lindner-Peikert strategy [5] and the pruned-enumeration strategy [18] . The retrieved solution in Lindner-Peikert's implementation is an approximation to a closest vector, while in prunedenumeration implementation is a closest vector. In order to show a wide rate of improvement and scalability of our implementation, a comparison has been conducted between the sequential implementation provided by Kirshanova et al. [1] and our implementation using a single GPU and two GPUs. In Lindner-Peikert's implementation, the speedups achieved by a single GPU and two GPUs are in average 6 and 13 times respectively. While in pruned-enumeration implementation, the speedups achieved by a single GPU and two GPUs are in average 9 and 16 times respectively. For comparing our parallel implementation to Kirshanova's parallel implantation, we run Algorithm 5 basing on the pruned-enumeration strategy and using the same parameters as in [1] , in Table 1 . The results showed that our implementation using two GPUs is more efficient than Kirshanova's implementation using 20 CPU cores.
The rest of this paper is organized as follows. In Section II, we present the related works. In Section III, we present a background on Lattice, LWE and GPU. In Section IV, we describe BDD approach for solving LWE. In Section V, we describe in details the proposed parallel algorithm for solving LWE. Section VI is dedicated to show the experimental results and comparisons to the previous works. Finally, conclusions and future work are presented in Section VII.
II. RELATED WORKS
Over the three last years, few experimental studies have been published to solve the LWE problem by means of parallel computing devices. In fact, the previous works have showed significant speedups using multi-cores compared with the sequential implementation. Bischof et al. [25] presented a parallel implementation of the BDD approach for solving the LWE problem. They introduced the first parallel implementation of the Lindner-Peikert's Nearest Planes algorithm; see Section IV, Algorithm 2. Using a machine with four CPU-chips of 16 cores, they have achieved speedups factors of more than 11 times than that in a single core. Afterwards, Kirshanova et al. [1] presented a parallel implementation for the Lindner-Peikert and pruned-enumeration strategies showing that the pruned-enumeration strategy is more efficient than the Lindner-Peikert strategy. The execution time was speeded up even further than that in the traditional non-parallel algorithm, i.e., with n processors the achieved speedup is roughly n. Recently, Rui Xu et al. [26] presented an improvement on Kirshanova's implementation achieving a fully utilizing of multi-cores by applying the extreme pruning rather than the linear pruning used in [1] . They showed that their implementation is more efficient than that of kirshanova et al.
III. PRELIMINARIES
This section presents the necessary background on lattice, LWE and GPU.
A. LATTICE AND LEARNING WITH ERRORS
Let R n be the n-dimensional Euclidean space. A lattice L can be described as a regular distribution of points in the space R n such that each point can be shifted to another point in the lattice space by some symmetric arrangement. Mathematically, lattice is an integral combination of the basis b 1 , b 2 , . . . , b m ∈ R n . More formally, the lattice L can be defined by
where n is the dimension of the lattice space and m is the rank of the lattice space L. If n = m we say that L is a full rank lattice. The famous problems in lattice fields are SVP (Shortest Vector Problem) and CVP (Closest Vector Problem). Given a lattice L, SVP is the problem of finding a shortest vector of non-zero length. Whereas CVP is the problem of finding a closest vector to a given vector which is usually called target vector.
The search process of a shortest vector is usually launched using an upper bound that depends on the lattice determinant det(L). The lattice determinant is invariant and it is equal to the lattice fundamental parallelepiped. If B is the matrix constructed of linearly independents raw vectors b 1 , b 2 , . . . , b m , then the fundamental parallelepiped centered around zero is defined as follows:
The common approach for solving the lattice problems is the lattice reduction approach. The lattice reduction is the process of finding a lattice basis that is more nearly to the orthogonal basis (orthogonal basis is a basis whose vectors are pairwise orthogonal). Actually, a basis for a lattice that is ''as close to orthogonal as it can be'' is the convenient one [27] . Reducing a basis provides an improvement on the basis. Additionally, the shortest vector of the reduced basis is an approximate solution of SVP. Moreover, the reduced basis leads more quickly to the exact solution. The famous algorithms for applying the lattice reduction are the LLL algorithm devised by Lenstra et al. [23] and the BKZ algorithm developed by Schnorr and Euchner [24] . BKZ is mainly based on applying algorithms of exact solution on a block of small size β. As the block size β increases, a shorter vector is retrieved, but on the other hand, the runtime increases significantly. Usually β is of big size, for example, β > 40 gives a shorter vector, but BKZ will not terminate in a reasonable time. Although, BKZ runs in a polynomial time for the most challenging problems of lattices, we cannot prove that it runs generally in a polynomial time [24] . LLL is a special case of BKZ with β = 2 and it runs in a polynomial time.
The initial step for reducing a basis of the lattice L is to generate an orthogonal basis. Then based on the obtained orthogonal basis, a reduction algorithm LLL or BKZ is applied. A famous method for producing an orthogonal basis uses Gram-Schmidt orthogonalization process.
Actually, the generated orthogonal basis spans the same space. However, it cannot be basis of the lattice, since it built-up the lattice of non-integral combination.
Gram-Schmidt orthogonalization process is applied by transforming any set of vectors to another set of vectors that are pairwise orthogonal. Gram-Schmidt orthogonalization of a basis (b * 1 , b * 2 , . . . , b * m ) can be computed by the following iterative formula [24] :
where
The learning with errors (LWE) problem, introduced by Regev [4] , is considered one of the most useful assumption in Cryptography. The main reasons of using LWE as a cryptographic primitive are its simplicity as well as its security against the quantum computation attack. Mathematically, LWE is defined as follows:
Definition 1: (LWE) Let n, q be positive integers, χ be a probability distribution on Z q and s be a secret vector following the uniform distribution on Z n q . We denote by L (n) s,χ the probability distribution on Z n q × Z q obtained by choosing a from the uniform distribution on Z n q , choosing e ∈ Z q according to χ and returning (a, c) = (a, a, s + e) ∈ Z n q × Z q . LWE is the problem of finding the secrete vector s for a given pair (a, c).
Naturally, LWE can be represented in matrix forms, which is more appropriate to convert an LWE instance to a CVP instance. In Matrix-LWE, the problem is constructed by gathering m samples of pair (a i , c i ) for 1 ≤ i ≤ m such that c i = a i , s + e i , where e i ∈ Z q is chosen according to the discrete Gaussian distribution. Define a matrix A of n rows such that a i is the i th column in A. Let c = (c 0 , c 1 , . . . , c m ) and e = (e 1 , e 2 , . . . , e m ). From the definition of LWE, we can get that c = As + e mod q.
The challenging task in the LWE problem is to reveal the secret vector s from the equation c = As+e mod q. If the error vector e is sufficiently small, then the vector As in the lattice L can be retrieved by reducing the problem to CVP. In other words, the vector As is a closest vector in the lattice L to a target vector t. This approach of solving the LWE problem is known as the BDD approach.
B. GPU ARCHITECTURE
GPU was mainly dedicated for images processing. It is used to work on thousands of pixels at the same time. In the recent years, the GPUs have widely connected to CPUs for the purpose of speeding up a huge parallel computations in socalled heterogeneous computing. Today, GPUs play a main role in parallel computing in different domains. It takes part in the computer cluster alongside the multi-cores processors, and even in the supercomputers. It is able to provide them by a high performance computing with a reasonable consumption of power.
In fact, the concept of using GPUs for a traditional application that was previously handled using CPU, helped in emerging a new concepts called General-purpose computing on GPU (GPGPU). GPUPU frameworks such as OpenCL (Open Computing Language) and CUDA (Compute Unified Device Architecture) enable the users to execute their own code on GPU using simple languages based on C.
GPU, from the hardware point of view, consists of many independent streaming multiprocessor (SMs). Each SM contains a collection of Streaming Processors (SPs) which are known as cores. In fact, GPU is designed to run in Single Instruction Multiple Data (SIMD) model. In this model, a single instruction is fetched out to many SPs such that each SP runs the same instruction but on different data. The implementation of the instructions is done through threads; 32 threads are grouped together to form a so-called warp. Each thread in a warp is mapped to a single SP and all threads in the same warp are guaranteed to start and finish their works at the same time. Additionally, the threads that run on GPU are grouped into blocks. Any block can run only on a single SM at time. Grid is the big category in GPU, it contains all blocks on GPU.
In order to decrease the latency and to increase the occupancy, thousands of threads have to be scheduled to run concurrently. Even if the number of SPs is not sufficient to run all threads at the same time, allocating their data on the memories will make them ready to run as soon as an SP becomes free.
The memories of GPU can be classified in three types according to the accessibility and the latency: 1) Memory of high accessibility and high latency: GPU is provided by a global memory ranges from 700 MB to 24 GB. All threads in GPU have the right to access the global memory. Actually, it is recommended to avoid the using of global memory for the reusable variables as possible due to its high-latency which reaches 400 cycles. 2) Memory of medium accessibility and low latency: Each SM has its own shared memory which is accessible only by the threads reside on the same SM. The threads belong to the same thread block can cooperate interchanging their data in shared memory. The shared memory is limited in terms of size which ranges usually from 64 KB and 128 KB, but fastest in terms of latency which ranges usually from 1 cycle to 32 cycles.
3) Memory of low accessibility and low latency: Each SM
is provided by a fastest and smallest memory called registers. The data allocated in registers for a thread are private and cannot be accessed by other threads. The size of this type of memory is very small; it ranges from 8KB to 64KB per SM. However, it operates at high speed of zero clock cycle for any access to the memory.
The major factors that impact directly on the performance of GPU depend on how the accessed data are arranged and where the access starts. Assume that the requested data for a warp are stored in contiguous units of memory and the first unit lies in position whose index is multiple of 128-bytes. In this case, the requested data for a warp can be fetched by a single transaction. Otherwise, many transactions used wasting a lot of time and increasing the latency. For boosting our performance regarding the memory accesses, the requested data have to be aligned and coalesced.
As we mentioned before the warp is executed in SIMD model. In SIMD model, all threads in a warp have to execute the same instruction on the same cycle. In other words, if the instruction contains a branch and only some of these threads meet the condition for this branch, then these threads would execute the branch of instruction and the others would be disabled. This event of enabling a part of the warp and disabling another part is called warp divergence. When we are dealing with divergences in GPU we must take into account that warp divergence occurs only within a warp. Different conditional values in different warps do not cause warp divergence. For the best performance, we must avoid warp divergence as much as possible.
From software point of view, the code that runs on GPU is launched in CPU code through a function called kernel. The kernel has two main arguments, the threads block size and the grid size which specify the number of threads per block and the number of threads blocks per grid respectively. Actually, we have two types of threads blocks; active threads blocks and inactive threads blocks. The threads blocks which are currently attached to the available resources of streaming multiprocessors such as registers and shared memory, are called active threads blocks, and the warps that contain active threads are called active warps. For a better performance we have to accommodate a large number of active warps as possible which helps in mitigating the impact of the latency generated by the inactive warps. The aspect of performance that related to active warps is called occupancy. Occupancy is the ratio of active warps to maximum number of warps per SM. To achieve better performance, we should increase the occupancy by accommodating large number of active warps.
IV. BOUNDED DISTANCE DECODING APPROACH FOR LWE
As we mentioned in Section I, there are three approaches for solving LWE: combinatorial, embedding and BDD approaches. In this section, we concentrate on the BDD approach. The concept of the BDD approach is based on solving LWE by reducing it to CVP. Formally, let B ∈ Z m×m be the matrix formed of m basis b 1 , b 2 , . . . , b m ∈ Z m of the lattice L, and t be a target vector ∈ R m . The BDD approach tries to find a vector v = m i=1 x i b i in the lattice L such that the Euclidean norm v − t is as shorter as possible.
One trivial implementation of the BDD approach is to enumerate all possible values of x i 's. Practically, this approach is unfeasible for high dimensions since it runs in time exponential in the dimension. However, it guarantees an exact solution (closest vector) of the LWE problem.
Babai [28] proposed an inductively algorithm called the Babai's Nearest Plane algorithm, Algorithm 1, for finding an approximate solution of a BDD instance in a polynomial time. The Euclidean norm of the output vector is within an exponential factor of the Euclidean norm of the closest vector.
Algorithm 1 Babai's Nearest Plane
. . , b m and the target vector t 2 Output: A closer vector v to t and the error e 1 12 return v such that v = m k=1 x k b k is a closer vector to t, and the error e 1
If v is the output vector of the Babai's Nearest Plane algorithm and e is the error, i.e., e = v − t , then the Babai's Nearest Plane
In the Babai's Nearest Plane algorithm, Algorithm 1, the output is found by applying the orthogonal projection of the target vector t onto span(b 1 , b 2 , . . . , b m ). The orthogonal projection is conducted in an inductive manner. More precisely, define the plane U = span(b 1 , b 2 , . . . , b m−1 ) and the hyperplane H c = U + cb m : c ∈ Z . The idea of the Babai's Nearest Plane algorithm is to construct a vector w ∈ {cb m : c ∈ Z} such that the distance from the target vector t to the plane U + w is minimal. In other words, determining an integer c such that the distance from t to the plane U + cb m is minimal. The integer c can be found by projecting the vector t on the axis {cb m : c ∈ Z} getting its coordinate, denoted by c , on that axis. Then the integer c is the nearest integer to that coordinate; c = c . Let t = t − w, now find out the closest vector v of t in the U plane. Inductively, the closest vector to the target t in the
On the other hand, Lindner and Peikert [5] introduced an algorithm for solving LWE. The algorithm is mainly based on adding some improvements on the Babai's Nearest Plane algorithm. The idea of Lindner and Peikert is to project the target vector onto the hyperplane several times rather than one time. Lindner The BDD enumeration can be represented as a weighted tree whose nodes are weighted by the value of x k . Each node 27 return v such that v = m k=1 x k min b k is a closer vector to t and A is connected to the next node by an edge, which goes either up or down. A node on the level k + 1 may have many descender nodes on the level k, depending on the value of x k ; the node with x k = c k − d k 2 is the first node and x k = c k + d k 2 is the last node. The search starts at the root of the tree at level m and moves down in depth-first search order toward the leaves that are exist at the bottom level 1. Each leaf reveals the values of all coefficients x 1 , x 2 , . . . , x m , so that the corresponding closer vector v = m k=1 x k b k is a potential solution. At level k between the root and the leaves level, the revealed coefficients are x k , x k+1 , . . . , x m . If x k ≤ c k max and e k < A, then the edge moves down and a new value x k−1 is revealed. Otherwise, the edge moves up, which means that the search tree is cut off at level k, i.e., this branch of the tree at level k will not lead to a potential solution.
In the Lindner-Peikert approach, we note that the number of children for a node at level k is based on setting a bound on the coefficient x k by using the predefined integer d k and the variable c k . In an attempt to improve the efficiency of the Lindner-Peikert algorithm, Liu and Nguyen [18] suggested to set a bound on the error length using the pruned-enumeration strategy. The idea of pruned-enumeration is based on cutting off sub-trees of the search tree in which the probability of finding a closest vector is very small. Formally, prunedenumeration replaces the inequality x k ≤ c k max , in Step 11 of Algorithm 2, by e k < R k where R 1 ≤ R 2 ≤ . . . ≤ R m = R are m real numbers predefined by the pruned-enumeration strategy.
For generalizing Algorithm 2, to be used for the pruned-enumeration strategy, we have to define d k as follows:
In case of the pruned-enumeration strategy, the computation of d k has to be set just after Step 15 of Algorithm 2.
In order to mitigate the search overload, we have to use the LLL reduction or the BKZ reduction as the first procedure applied on the input basis, the obtained reduced basis speeds up the enumeration search.
V. THE PROPOSED GPU-BASED ENUMERATION ALGORITHM
In this section, we present a new parallelization strategy to solve the LWE problem. In Subsection V-A, we show the main idea of the parallelization strategy using GPU. In Subsection V-B, we show some improvements on Algorithm 2 by using GPU features. In Subsection V-C, we describe a method for generating GPU-points such that each GPU-point is corresponding to a subtree search on GPU. In Subsection V-D, we present Algorithm 5 as a result of applying the proposed parallelization strategy on Algorithm 2 using GPU.
A. MAIN IDEA
In fact, the enumeration tree of Algorithm 2 comprises of m levels, starting at the root level and finishing at leaves level. Each level contains a large number of adjacent nodes that are independent, i.e., each node traverses its path independently. We can exploit this independency between the adjacent nodes, so that each node executes its code in parallel. Since GPU is capable to run concurrently thousands of threads, it can be fully exploited for implementing a parallel version of Algorithm 2.
The proposed parallel enumeration algorithm consists of two parts. The first part is the top-levels, which runs on CPU and the second part is the bottom-levels, which runs on GPU. A level α separates the two parts; see Figure. V-A. It is used as a barrier between them. On this level, the data are exchanged between the CPU and GPU. In the rest of this paper, we will call the nodes on the level α the GPU-points.
In fact, we have two famous approaches to take advantage of the parallelism in GPU: the fine-grain approach and coarse-grained approach. The fine-grain approach is appropriate in case the task can be broken down into small sub-tasks in terms of code size and execution time, and the data are transferred frequently between the sub-tasks. In contrast, the coarse-grain approach provides much performance improvements in case the sub-tasks need large amount of computations, and the data communications are infre- quently. Although the fine-grain approach seems to be most appropriate to be adopted in our case, the main obstacle is the rapid growth of the GPU-points with respect to the level. The GPU-points are exponentially increased as the level selected even further down in the tree. In addition, the GPU resources are limited, so GPU could not accommodate the massive quantity of data generated after each level. Therefore, we used corse-grain parallelism strategy, where each GPU-point on level α corresponds to a sub-tree in GPU. We expect that the corse-grain approach is better due to the following reasons:
1) The quantity of the GPU-points (which is corresponding to the sub-trees on GPU) increases exponentially with respect to the levels. 2) Each sub-tree search can be implemented individually on a single GPU core.
3) The data exchange between different sub-trees is limited, and no global synchronization is required.
B. IMPROVED BDD ENUMERATION ALGORITHM
The implementation of Algorithm 2 is based on CVP's implementation of fplll [29] . For efficiency, rather than using the vectors and the inner product of the vectors as in Steps 13 and 15 of Algorithm 2, we use the coordination (v 1 , v 2 , . . . , v m ) of the target t with respect to the Gram-Schmidt basis of the lattice such that:
Now by substituting the value of t as composition of coefficients v i 's and the vectors b * i 's, we can replace the equality in Step 15 of Algorithm 2 by the following:
Eq. (6) is more convenient for the GPU implementation since the inner product is avoided and the amount of computation and data are reduced. There is no need to store the vectors b * i and t i+1 in the memory, so Step 13 of Algorithm 2 can be removed. See Algorithm 3, Steps 15, 16, 17 and 18.
Additionally, for enhancing the performance of Algorithm 2, we have to make use of the registers and the shared memory.
Actually, making use of registers in Algorithm 2 is restricted due to the fact that arrays are stored in registers only if: the indices used to reference the array are constant and can be determined at compile time [30] . Since the most used variables in our implementation are arrays, the only way to store them in registers is to convert them to scalar variables. For example, rather than defining an array e k , where 1 ≤ k ≤ m, we can use a scalar variable e and trace its values for any k as follows: Replace Steps 8 and For the shared memory, when we fix α, for each thread in GPU, the coefficients (x 1 , x 2 , . . . , x m ) are used frequently in two operations; reading (see Steps 11, 17, 23, 26 and 28 of Algorithm 3) and writing (see steps 21 and 28 of Algorithm 3). Therefore, it will be better to store them in the shared memory in order to speed up data transfer rates. Since the shared memory is very limited, and storing arrays especially for big dimensions may affect the number of active threads in the kernel and subsequently the occupancy is affected. Since the GPU algorithm works mostly on the levels below α, we can mitigate the use of the shred memory by storing only the values (x 1 , x 2 , . . . , x α ) in the shared memory, while (x α+1 , x α+2 , . . . , x m ) are stored in the global memory.
We will show experimentally the effects of these proposed improvements in Subsection VI-C.
C. GENERATING GPU-POINTS
For generating the GPU-points, we use Algorithm 4 which is implemented in depth-first search order. The algorithm moves down from the root to the barrier level α. There are two cases for x α ; see Step 12 of Algorithm 4.
1) If x α ≤ c α max and e α < A. Then the corresponding GPU-point p is accepted and added to the batch of the GPU-points. The next GPU-point can be generated without needing to traverse the tree since x α for the next GPU-point is calculated just by adding 1 to the current x α ; see Algorithm 4, Step 28. As long as the inequalities x α ≤ c α max and e α < A are held, the next GPU-point is generated by the same way; see Algorithm 4, Steps 23, 24 and 28. 
D. FINAL ALGORITHM
Practically, the amount of the GPU-points is extremely large, so they cannot be sent at the same time to be executed on GPU. The possible way is to send them in batches until they all are covered. When a batch achieves its work on GPU, it is conveyed back to CPU. Then on CPU, the GPU-points whose tasks are completed in GPU are replaced by other GPU-points. Then the new batch of GPU-points are sent to GPU to be executed. Thus, the execution of our parallel enumeration algorithm is carried out synchronously switching between CPU and GPU parts until all GPU-points on level α are generated, sent to GPU and their corresponding searches on GPU are completed. In each switch to CPU, a batch of GPU-points is generated. The number of GPU-points in a batch is fixed based on GPU resources. In GPU, each subtree which is corresponding to a GPU-point is executed in parallel; see Algorithm 5.
Although the GPU-points are substantially increased as level α is selected even further down in the tree, the GPU-points produced on low levels guarantees a low divergence during the execution on GPU, for more details, see Subsection VI-C. The most challenging task is how to choose 11 Copy S to GPU memories and launch the GPU's kernel 12 while number of GPU-points that complete their searches > η do 13 Each thread is assigned to a GPU-point and runs Algorithm 3 at level α. When a cut off occurs or the thread comes back to the level α, the thread terminates the search. 14 end while 15 Leave GPU's kernel 16 Receive the output from the GPU and copy it to CPU, precisely, in set S 17 S = S − {GPU -points that complete their searches on GPU } 18 end while level α that strikes the balance between the overhead of communications between CPU and GPU due to the rapid growth and the occurrence of divergence in GPU stage. The better choice of level α can be determined experimentally.
Actually, in CUDA, the number of threads that can run concurrently is determined according to the data size and the block size. In other words, after allocating the data on GPU and setting the block size; CUDA determines automatically the number of blocks that can be executed on GPU at the same time. Although the global memory in GPU is not small, the shared memory and registers are limited. So the number of threads that can be executed on GPU at the same time depends on the available resources (the shared memory and registers).
For detailed description, assume that GPU can hold only data of X GPU-points in the global memory, but only data of h of them can be held in the shared memory and registers, where h < X . In this case, the proposed strategy in [33] is to allocate the data of X GPU-points in the global memory and set the number of executed threads of the kernel on GPU to h. Thus, the data of h GPU-points will be copied from the global memory to the shared memory and registers once the execution on GPU is launched. When any element of h GPU-points completes its execution on GPU, its data are replaced by the data of an element of X −h GPU-points in the global memory. The data of the old GPU-points are released and the data of the new GPU-points are copied from the global memory to the shared memory and registers, then the thread is executed on the copied data. By the same way the data of all X − h GPU-points in the global memory will be copied to the shared memory and registers and used for the execution on GPU.
Algorithms 5 and 4 are mainly built-up for a single GPU. However, they can be executed on more than a single GPU. The batches of GPU-points that are generated on level α can be distributed on multi-GPUs rather than a single GPU. On CPU we need to use multi-threads such that each thread is used for a specific GPU. In addition, the shared variables of CPU between the threads are needed to show the position of the last generated GPU-point on level α. The shared variables of CPU must be locked before any writing process. Since the GPU-points used for the multi-GPUs are generated on the same level, the shared variables of CPU will be increasingly required at the same time by many threads. Consequently, a big intersection between the threads for accessing the shared variables of CPU will happen, and that will affect the performance of the algorithm. An alternative solution for mitigating those intersections is to generate seeds for the threads on a top-level called β. Each seed is used for a specific thread; the seed enables us to generate many batches of GPU-points without need to access the shared variables of CPU. Once the GPU-points that can be generated by a seed is over, another seed will be generated and only in this case, the threads need to access the shared variables of CPU.
VI. EXPERIMENTAL RESULTS
This section presents the results obtained by using Algorithm 5 for the parallel implementation and Algorithm 2 for the sequential implementation. In Subsection VI-A, we present briefly the platform specification and data set used in our implementation. In Subsection VI-B, we show the execution times of Algorithms 5 on a single GPU and two GPUs as well as the execution time of Algorithm 2 using the sequential implementation provided by Kirshanova et al. on a our CPU. In Subsection VI-C, we show experimentally the effects of the proposed improvements on Algorithm 5 using GPU features, such as registers, shared memory, coalescing access, and so on. We go through the different stages of the improvements showing how each stage contributes in improving Algorithm 5 to obtain the final results as in Tables 1 and 2. A. PLATFORM SPECIFICATION AND DATA SET The implementations of Algorithms 5 and 2 were conducted on a PC with Intel core i5-3750 processor, 3.40GHz, 8GB RAM, Ubuntu Linux 16.10. The PC is coupled with two NVIDIA GeForce GTX 1060 6G GPUs each contains 1280 CUDA cores (10 multiprocessors with 128 cores each), a clock speed of 1.5GHz, and a warp size of 32 threads.
The CPU code was written in C++ and compiled using g++ version 5.4, while the parallel implementation is examined on GPU using CUDA 9.1 and compiled using nvcc compiler. For handling the multi-threading in CPU we use OpenMP [31] .
The data set used in our implementation was generated using the package provided by Kirshanova et al. [1] . The dimension n ranges from 40 to 100 with number of samples m changes between 70 and 210; see Tables 1 and 2 . The parameter q = 4093 and the standard deviation s has been set in the Lindner-Peikert algorithm to 6 and 5 for n ∈ {40, 50, 60, 70} and n ∈ {80, 90, 100} respectively, while in the pruned-enumeration strategy is set to 6, 5 and 4 for n ∈ {40, 50, 60, 70}, n ∈ {80} and n ∈ {90, 100} respectively.
B. IMPLEMENTATION
In this subsection, we describe in details the implementation of Algorithms 2 and 5. We start this section by mention some important parameters and considerations that have effects on the evaluations:
1) The level α on which we create the GPU-points. The level α is selected according to experimental results, i.e., for one or two samples, we run our proposal implementation with many levels. Based on the results, we complete the execution of other samples on the best level. For example, for dimensions m = 140 and n = 70 the level 125 seems much better, see Table 7 . Generally, based on experimental results, we found that the levels between 20 and 33 are the better choices.
2) The level β on which the seeds for multi-threads are generated. The level β is set to 2. 3) Let h be the size of transferred GPU-points to GPU for each calling of kernel; see Algorithms 4 and 5, Steps 25 and 5 respectively. We choose h equals to 400000. 4) Algorithm 5 using the pruned-enumeration strategy was implemented using the same parameters (n, m, q and s) as in [1] except for n = 100; see Table 1 in [1] . 5) For improving the quality of the lattice basis, it is recommended to use a reduction on the LWE's sam-pler before launching the search. The BKZ algorithm is considered as one of the most practical reduction algorithm. We found that the best implementation of BKZ is provided by Shoup in the NTL library [32] .
We made three types of implementations:
1) Using a single CPU (one core).
2) Using a single CPU (one core) and a single GPU.
3) Using a single CPU (two cores) and two GPUs.
For the first type of implementation, which is on CPU only, we used the sequential implementation of the Lindner-Peikert and pruned-enumeration strategies provided by Kirshanova et al. in [1] . For the second and third types of implementations, which are on CPU and GPU, we have used our algorithm, Algorithm 5.
The correctness of the parallel implementation of our proposal is verified by comparing our outputs to the outputs of the sequential implementation provided by Kirshanova et al. on our CPU.
The data in Tables 1 and 2 show the following results:
1) The single GPU speeds up the execution time using the Lindner-Peikert and pruned-enumeration strategies by a factor of almost 6 and 9 respectively. 2) Two GPUs speeds up the execution time using the Lindner-Peikert and pruned-enumeration strategies by a factor of almost 13 and 16 respectively. 3) Two GPUs achieved speedup around 2 times than the speedup achieved by a single GPU. Sometimes the speedup is higher than 2 when using two GPUs instead of one GPU since using two GPUs allows to reveal more quickly a better value of the bound A. 4) Implementing Algorithm 5 using the pruned-enumeration strategy and GPUs is more efficient than Kirshanova's implementation using many cores. For examples, a) If n = 70 and m = 140, Kirshanova's implementation of pruned-enumeration using 10 threads (cores) takes 300s, while our implementation of pruned-enumeration using a single GPU takes only 97s; see Table 2 in this paper and Table 1 in [1] . b) If n = 80 and m = 150, Kirshanova's implementation of pruned-enumeration using 20 threads (cores) takes 3000s, while our implementation of pruned-enumeration using two GPUs takes only 2323s; see Table 2 in this paper and Table 1 
C. PERFORMANCE ANALYSIS
In this subsection, we show the performance analysis in term of the achieved improvements. Actually, the improvements in Tables 1 and 2 are achieved by using the different features of GPU such as registers, the shared memory, coalescing access, and so on. In order to measure the improvements, we use the execution time as well as other GPU factors, e.g., branches efficiency, occupancy, gld_efficiency and gst_efficiency. In fact, these GPU factors are used to evaluate the efficiency of our implementation with respect to divergence, occupancy, global memory access. The following formulas [30] can be used to measure these GPU factors. The experimental analysis performed on a single GPU with the same data set that used in the previous subsection. We improve the implementations of Algorithm 5 and Algorithm 2 through the following stages: F1 Straight: Using a straight implementation of Algorithm 2. F2 Registers: Using registers to improve the implementation of Algorithm 2; see Subsection V-B.
F3 Coalescing access of global memory: Using the strategy of coalescing access of the global memory to improve the implementation of Algorithm 2. F4 Shared memory: Using the shared memory to improve the implementation of Algorithm 2; see Subsection V-B. F5 Reducing the size of arrays stored in the shared memory:
this improvement is explained in the last paragraph of Section IV; see Subsection V-B. F6 Reducing CPU-GPU communications: CPU-GPU communications can be substantially reduced as follows, for each run of GPU's kernel, the kernel continues running until 95% of the sub-trees terminates their search. In other words, after each run of GPU's kernel, only 5% of the sub-trees does not finish their search; let us call this type of sub-tree unterminated sub-trees. Formally, for any execution of GPU's kernel, the execution is terminated, once the number of unterminated subtrees becomes less than or equal to Z , where Z is a small positive integer. For example, if the total number of the sub-trees sent to GPU is 400000, we can set Z = 20000. F7 Generating some GPU-points (that need one step to be generated) in GPU rather than CPU: This improvement is used to reduce the CPU-GPU communications by generating some GPU-points in GPU. Actually, some GPU-points can be anticipated from their adjacent points without need to traverse the tree. The x k for any GPU-point of this type is calculated just by incrementing the x k by one; see Algorithm 4, Step 28.
The data in Tables 3, 4 , 5 and 6 show the following results:
1) Using registers speeds up the execution time by factor of almost 1.6. For example, in Table 6 , the execution times with registers and without registers are 6483s and 10532s respectively. 2) In coalescing access of the global memory, we can distinguish between two cases: a) Coalescing access without reducing CPU-GPU communications. In this case, the improvement of coalescing access is not ensured, especially in high dimensions. For example, in Table 6 , in the second and third rows; the execution times with coalescing and without coalescing are 6958s and 6483s respectively. b) Coalescing access with reducing CPU-GPU communications. In this case, the improvement of coalescing access speeds up the execution time by factor of almost 1.1. For example, in the last two rows of Table 3 , the execution times with coalescing and without coalescing are 2.65s and 2.85s respectively.
In fact, the lack of an improvement in the first case is due to the following reason: for accessing the global memory of GPU in coalescing model, the data need to be arranged in CPU each time before being sent to GPU. This operation of arranging data takes time in CPU due to the irregular memory access on CPU. Thus, if we reduce CPU-GPU communications, the operation of arranging data in CPU is reduced, so the improvement of coalescing access of the global memory appears in the results, as in the last row of Tables 3, 4 , 5 and 6. 3) Using shared memory speeds up the execution time by factor of almost 1.2. For example, in Table 3 , the execution times with shared memory and without shared memory are 5.08s and 6.73s respectively. The decreasing of occupancy after using the shared memory is due to that the shared memory is scarce resource in the SM. Thereby the number of the active warps per stream processor is decreased. Although there is no improvement in global load efficiency and global store efficiency, the execution time is better. That because of, even with the shared memory, the global memory is accessed to set the initial values of data that are stored in the shared memory, and also to transfer the output data from the shared memory to the global memory.
With the shared memory, the difference is that the inner data manipulation is done through L1 cache, while without the shared memory, the data manipulation is done through L2 cache which is slower. 4) Reducing the size of arrays stored in the shared memory speeds up the execution time by factor of almost 1.5. For example, in Table 6 , the execution times with reducing the size of arrays and without reducing the size of arrays are 4426s and 6087s respectively. 5) Reducing CPU-GPU communications speeds up the execution time by factor of almost 1.1. For example, in Table 6 , the execution times with reducing CPU-GPU communications and without reducing CPU-GPU communications are 4187s and 4426s respectively. 6) Generating some GPU-points in GPU speeds up the execution time by factor of almost 1.2. For example, in Table 6 , the execution times with generating some GPU-points in GPU and without generating some GPU-points in GPU are 3717s and 4187s respectively. Finally, Tables 7 and 8 show the effect of parameter α on the performance of the parallel implementation. According to the results of Tables 7 and 8 , global load efficiency, global store efficiency, divergence and occupancy increase as the level α decreases, but the cost of CPU-GPU communications increases, so the execution time increases as well.
VII. CONCLUSION AND FUTURE WORK
In this paper, we have proposed a new parallelization strategy of the BDD enumeration algorithm using the high computation power of GPU. We have shown that significant speedups are achieved using GPU. Our experimental results showed that a single GPU speeds up a sequential solution of LWE by a factor of almost 6 in the Lindner-Peikert strategy and 9 in the pruned-enumeration strategy. Additionally, the scalability was significantly reached; an additional GPU has improved the scalability further. Our experimental results showed that two GPUs provide speedups around 2 times than the speedups achieved by a single GPU. Moreover, experimental results showed that our implantation using two GPUs is more efficient than Kirshanova's implementation using 20 CPU cores. For the future work, it remains to combine both multi-GPUs and multi-cores for implementing our parallel algorithm, especially since the most computers are equipped with many cores.
