Abstract-Should quantum computing become viable, current public-key cryptographic schemes will no longer be valid. Since cryptosystems take many years to mature, research on postquantum cryptography is now more important than ever. Herein, lattice-based cryptography is focused on, as an alternative postquantum cryptosystem, to improve its efficiency. We put together several theoretical developments so as to produce an efficient implementation that solves the Closest Vector Problem (CVP) on Goldreich-Goldwasser-Halevi (GGH)-like cryptosystems based on the Residue Number System (RNS). We were able to produce speed-ups of up to 5.9 and 11.2 on the GTX 780 Ti and i7 4770K devices, respectively, when compared to a single-core optimized implementation. Finally, we show that the proposed implementation is a competitive alternative to the Rivest-ShamirAdleman (RSA).
I. INTRODUCTION
Most public-key cryptosystems currently in use, such as RSA, rely on the intractability of factoring and computing discrete logarithms. However, in 1994, Shor proposed efficient quantum algorithms to solve these problems [1] . Hence, should quantum computing become viable, currently-in-use cryptosystems will be broken. As such, research on efficient post-quantum public-key cryptosystems is most valuable.
Lattice-Based Cryptosystems (LBCs) hold a great promise for post-quantum computing since lattice-based problems are thought to be hard even in a quantum computing setting. This type of cryptography appeared during the 90s, with proposals such as GGH [2] and NTRU [3] . In GGH-like approaches, a plaintext is a small "error" added to a vector of a lattice. The public key corresponds to a "bad basis" of the lattice, with which solving the CVP is hard, and the private key is a "good basis", enabling to compute the closest vector for small errors. The decryption just consists on solving the CVP.
The security of GGH and NTRU is assumed to rely on the hardness of CVP, but none reduction proof exists yet. In practice, the size of parameters depends on the complexity of best known attacks such as lattice reductions. In their first form, GGH signature and encryption schemes were severely broken [4] , [5] and thwarting these attacks is still a current concern [6] - [8] . Recently, GGH has got many improvements [9] - [11] to make it competitive under secure parameters. This article deals with efficient implementation of GGH decryption function, via an arithmetical approach as suggested in [12] .
The RNS allows to split large integer arithmetic over many small finite fields, enabling the exploitation of the integerarithmetic architectural optimizations of most programmable platforms, while removing the overhead associated to working with multi-precision integers. Further, since each small finite field operates independently, the RNS is prone to data parallelism. Since both multi-core Central Processing Units (CPUs) and Graphic Processing Units (GPUs) have become widely available during the past decades [13] , herein, the aforementioned RNS lattice-based cryptographic approaches are analyzed, further developed, and their performance is evaluated for multi-core CPUs and GPUs.
The rest of this paper is organized as follows. In Section II, lattice theory and the RNS are reviewed. In Section III, GGH decryption is analyzed, with a focus on how RNS may be exploited. This procedure is parallelized and implemented; and evaluated and compared to related work in Sections IV, and V, respectively. Then, its performance is compared to that of RSA. Finally, conclusions are drawn.
II. BACKGROUND
A full-rank lattice L is defined as the set of all integral combinations of n-linearly independent vectors r 0 , . . . , r n−1 ∈ R n . If the basis is represented as a matrix R, having the basis vectors as rows, the lattice generated by R can be defined as L(R) = {xR : x ∈ Z n }, where x is represented as a rowvector, and xR denotes the usual vector-matrix multiplication. Herein, vectors r 0 , . . . , r n−1 are restricted to Z n . If U is a unimodular matrix (i.e., an integer square matrix with determinant ± 1), the basis R and U R generate the same lattice. In fact, any lattice admits an infinite number of bases as soon as n ≥ 2. Additionally, due to its periodicity, every lattice L induces an equivalence relation over Z n defined as follows: v ≡ L w if and only if v − w ∈ L. Reducing a vector 
A. Lattice-Based Cryptography A lattice-based encryption scheme will now be described, using Rose's approach [10] . The private basis, R, is produced as a rotated nearly-orthogonal basis, such that Babai's RoundOff procedure [14] may be used to compute the closest vector. Moreover, the public basis is of a Hermite Normal Form (HNF). The HNF is a basis of L(R), B ∈ Z n×n , such that Since the public-key is of an OHNF, it is representable by a single column, denoted as B0. A plain-text is then represented as a vector p ∈ Z n . To encrypt it, p is reduced modulo the public basis, by applying the algorithm of Figure 1 . Due to the public basis structure, the cryptogram corresponds to a vector, where all the entries but the first are zero. Thus, it suffices a scalar c to store its value. In order to decipher c, Babai's Round-Off algorithm is applied. This procedure, represented in Figure 2 , gives an approximation to the CVP. If
then the algorithm produces the correct closest vector [10] .
B. Residue Number System
The computation of multi-precision arithmetic may be split among several channels by exploiting the RNS. Under this system, numbers are represented as their remainders when divided by the set of co-primes m 0 , m 1 , . . . , m s−1 that form the RNS basis. Additions, subtractions and multiplications modulo M = s−1 i=0 m i can then be performed independently for each modulo of the set. Furthermore, the Chinese Remainder Theorem (CRT) provides a formula to recover the value of A from a i = A mod m i for 0 ≤ i < s, for 0 ≤ A < M :
is the multiplicative inverse of M mi modulo m i . Modular reduction may be implemented using an adaptation of Montgomery's algorithm [15] to RNS [16] . The RNS Montgomery algorithm replaces the modulo of the reduction D, by another more suitable modulo M 1 . With that purpose, Q is defined to be the value that satisfies A+QD ≡ 0( mod M 1 ), for M 1 × D > A and M 1 co-prime to D. The value of Q may be computed using an RNS set M 1 such that 
III. RNS BABAI'S ROUND-OFF
Exploiting RNS to accelerate Babai's Round-Off algorithm requires the computation of Figure 2 to be converted to integer arithmetic [12] .
is rewritten as:
where v 1 = (1, . . . , 1). Furthermore, since p is restricted as stated in (1), if β > 2 √ n 2 −1 then the computation of Figure  2 may be performed modulo β. In this work, the value β was selected to be m 2,0 , that is, the value of the first element of the second RNS Montgomery basis.
A description of the resulting algorithm can be found in Figure 3 . The value of v 1 corresponds to det(R), and is used to compute a ← 2cR 0 + (v 1 , . . . , v 1 ) mod m 2,0 . Afterward, the value of a mod D R is determined, where D R = 2 × det(R), using RNS. In order to compute this value,
The value of M 1 regards the first Montgomery basis, and m 3 is an extra modulo that is introduced to simplify the reduction operation. The extra M 1 m 3 term will be eliminated during the Montgomery reduction.
The ReduceModDR function is described in Figure 4 . First,
Afterward, q 1 is extended to the second basis M 2 , by evaluating (2) for each of the moduli of M 2 . In the figure, m j,i corresponds to the (i + 1) th modulo of base j. It should be noted that the value of k in (2) is set to zero, and therefore there will be an extension error less than (s − 1)M 1 [16] . As such,
is bounded by a 2 < (s + 1)D R . Hence, a 2 should be reduced a second time, by using an extra modulus m 3 . If m 3 > (s + 1)
6: p ← (c, 0, . . . , 0) − bR mod m2,0 7: return p Fig. 3 . Improved Decryption Algorithm R a 3 mod m 3 . If a 2 < 0, D R must be addded so that the result is fully reduced. When a 2 < 0, its RNS representation corresponds to
, whereā 2,s−1 denotes the s th MRS digit of a 2 . In the algorithm, the RNS digits are overwritten with the MRS digits, and computed in the most inner loop. Afterward, D R is added to the result modulo m 2,0 if a 2 < 0. Subsequent to the addition, the returned value from the function in figure  4 corresponds to a ← a mod D R mod m 2,0 . The plain-text p is then evaluated as p ← (c, 0, . . . , 0) − a−a 2det(R) R mod m 2,0 .
IV. PARALLELIZATION AND IMPLEMENTATION DETAILS
In this section, parallelism is exploited to speed up the execution of the presented algorithms. Several Application Programming Interfaces (APIs) were used in this work to exploit different levels of data parallelism, namely OpenCL [17] for GPU programming, OpenMP [18] for exploiting multi-threaded CPU parallelism, and AVX2 [19] for Single Instruction Multiple Data (SIMD) parallelism.
zL = z&(2 l − 1); zH = z >> l; z = zL + (2 l − mi)zH 4: end while 5: return z ← min(z, z − mi)
Fig. 5. GPU Modular Reduction

A. GPU Approach
As a first approach to the implementation of Babai's Round-Off algorithm, a CPU offloaded the execution of the ReduceModDR function to the GPU, transferring the required data. During the GPU execution of line 4 of the algorithm in Figure 3 , the CPU executes line 3 simultaneously. After a synchronizing the CPU and the GPU operations, which assures that the ReduceModDR results were fully transferred to the CPU, the computation of lines 5 and 6 takes place on the CPU.
The computation of a ← 2cR 0 + (v 1 , . . . , v 1 ) mod m 2,0 was split among the cores of the CPU. Each core computed a subset of the result.
Modular reductions of z in channel m i on the GPU were performed using the algorithm in Figure 5 . Further, the values of the moduli were selected such that 2 l−1 < m i < 2 l ; the operation min(z, z − r i ) was performed using unsigned arithmetic; and the while therein was unrolled.
The ReduceModDR function was implemented as an OpenCL kernel. Each work-group was associated with a single dimension, and each work-item with a modulo of M 1 and another modulo of M 2 . The resulting kernel only requires 3 barriers: after lines 4, 9 and 14 of Figure 4 . Moreover, lines 7 up to 9, and 16 up to 20 are executed on a single thread. Lastly, since the considered GPUs operated on a maximum of 32-bits, the value l of Figure 5 was set to l = 16, so that multiplications did not overflow the result.
After the reduction result is transferred to the CPU, b ← a−a 2det(R) mod m 2,0 is co-jointly computed by multiple threads in the multiple available CPU cores. Then the vector-matrix multiplication bR mod m 2,0 takes place: each core multiplied a set of entries of b by the corresponding lines of R, and afterward the partial results of each core were added to produce the multiplication result. Finally, the value of p ← (c, 0, . . . , 0) − bR mod m 2,0 is determined, and each core computes a subset of the final result.
B. CPU Approach
The second approach herein presented is similar to the one for the GPU, except that all computation takes place on the CPU. The steps that were executed on the CPU in Section IV-A take place in a similar way. Additionally, the ReduceModDR function, which still made use of the RNS, was enhanced with multi-threading, with each core computing part of the loop iterations in line 3 of Figure 4 .
The OpenMP #pragma omp for directive was used to split the multi-dimensional computation of a ← The vector-matrix multiplication bM required not only the use of #pragma omp for but also of #pragma omp critical for the sum of the threads partial results. For executing the ReduceModDR function on the CPU, a #pragma omp for directive was applied to the line 3 of Figure 4 . It should be noted that, since the targeted CPUs featured datapaths of 64-bits, the moduli bit-width was changed to 32-bits.
1) SIMD Parallelism: SIMD parallelism was used to enhance the execution on the CPU. Another method was implemented, similar to the previous one, but the ReduceModDR was modified to exploit SIMD extensions. First, it was possible to process multiple channels at a time for the steps in lines 4, 6, 10 and 13 of Figure 4 . Second, it was possible to accelerate all the summations by splitting their computation over multiple summations and perform those in parallel.
In order to perform multiple operations in parallel, data was loaded to the AVX2 registers using the vmovdqu instruction, which loads 256 bits from memory to a register. Then, words were rearranged using vpshufd so that the 32 most significant bits of each 64-bit word was set to zero. Multiplications and additions may afterward take place without overflowing the result lanes. Modular reductions after multiplications were performed using the algorithm of Figure 5 . Finally, when the desired result is obtained, registers are rearranged using vpshufd and vpunpckldq, and stored with vmovdqu.
V. EXPERIMENTAL RESULTS
The proposed methods were implemented and thoroughly tested. Also, the sequential method of Figure 2 , which does not make use of RNS, was implemented using the NTL 6.2.1 library [20] for comparison purposes. They were tested on three systems: i) an i7 3930K with 32GB of RAM and 4 cores, operating at 3.2GHz, and a GeForce GTX 680 with 2GB of main memory with 1536 Shader Processing Units (SPUs), operating at 1GHz; ii) an i7 4770K with 32GB of RAM and 4 cores, operating at 3.5GHz, and a Tesla K40c with 12GB of main memory and 2888 SPUs, operating at 0.7GHz; iii) an i7 4770K with 32GB of RAM and 4 cores, operating at 3.5GHz, and a GeForce GTX 780 Ti with 3GB and 2880 SPUs, operating at 0.9GHz. All code was compiled with gcc 4.7, with the -O3 flag, and times were measured using the readtsc instruction. 512 random messages were encrypted, and the average decryption time was measured, for n ∈ {400, 600, 800, 1000}. The performance is reported in Tables I and II . The RNS-GPU label is used for the approach that implements ReduceModDR on the GPU, whereas for the 4-core RNS-CPU label this function runs on the CPU.
The results show that it is possible to similarly enhance the performance on all platforms, when SIMD extensions are not used. Further, since when using RNS it is possible to choose channels whose bit-width is smaller than the word-length of the machine, it is expected that the presented techniques work for a wide range of general-purpose platforms.
The graphics show that there is a direct link between the GPU performance and their memory bandwidth, since the GTX 780 Ti has outperformed the remainder GPUs.
Execution Times [×10 6 clock cycles] Method n = 500 n = 800 [21] 294 .42  1323  TABLE III  DECRYPTION PERFORMANCE FOR THE INTEL CORE 2 DUO PLATFORM. This results from the low arithmetic intensity of the kernels. Moreover, the K40c and the GTX 680 were outperformed by the i7 platforms. There are two aspects that contribute to this behavior. One is related to the memory transfers between the CPU and the GPU, that must take place when the GPU is used. Even though it is possible to hide part of this overhead by executing line 3 of Figure 3 in parallel on the CPU, this step has a small arithmetic complexity. The other is concerned with the different moduli that are used. Since it is possible to work with moduli whose bit-width is twice as large when only using the CPU, the number of arithmetic operations to be performed is approximately halved.
Finally, AVX2 extensions greatly boosted the performance of the decryption operation. This was only possible due to the use of the RNS which, due to its carry-free nature, is very well suited to speed up computation with SIMD extensions.
In [21] , a similar cryptosystem to the herein presented was implemented using the NTL 5.5.2 library on an Intel Core 2 Duo platform, running at 2.1 GHz, with a 4 Gb RAM. Even though different platforms were used, the number of clock cycles reported in Table III are in the same order of magnitude  to those of Table I for the sequential method. As such, one may conclude that is most beneficial not only to employ RNS for the whole Babai's Round-Off procedure, but also that LBCs are greatly enhanced with data parallelism.
VI. PERFORMANCE COMPARISON WITH THE RSA CRYPTOSYSTEM Whereas some related art states that safe implementations of GGH-like cryptosystems should be of dimension at least 400 [5] , more pessimistic approximations propose dimensions of at least 800 [22] . We compared the performance of the decryption operation for dimensions of this order of magnitude with the performance of the equivalent RSA operation, for typical security parameters. The RSA cryptosystem was tested using OpenSSL 1.1.0-dev [23] on the i7 4770K platform, and its performance, as well as the performance of the AVX2 implementation of the GGH-like decryption is reported in Figure 6 . Notably, OpenSSL makes use of the 128-bits SIMD technology SSE2, in order to accelerate multi-precision integer arithmetic. One concludes that the GGH decryption operation takes approximately the same time to execute for dimensions of 400 and 1000, as the equivalent RSA operations for 3072 and 7680 bits, respectively. Taking into account that the proposed implementation has the advantage of post-quantum security, it presents itself as a competitive alternative to RSA.
VII. CONCLUSIONS AND FUTURE WORK
In this work, the proposals [11] , [12] for using RNS to enhance the decryption procedure of GGH-like cryptosystems Execution Times [×10 6 clock cycles] (Speed-up) Method n = 400 n = 600 n = 800 n = 1000 Sequential (i7 3930K) were considered and concretized. They were applied not only to GPU accelerators, but also to CPU devices. Maximum speed-ups of 5.9 and 11.2 were obtained for the GTX 780 Ti and i7 4770K devices, respectively, in comparison with a sequential multi-precision floating point approach.
One concludes that due to the burdensome memory transfers between the CPU and the GPU, it is often best to execute the whole decryption procedure on the CPU. Furthermore, since the RNS lends itself to SIMD parallelism, it was possible to greatly boost the performance using the AVX2 extensions. Moreover, it was concluded that LBCs present a competitive post-quantum alternative to RSA.
Future works should focus on arithmetically optimized implementations of alternative cryptographic primitives relying on ideal lattices and Learning With Error problems, which are core components of current homomorphic schemes [24] , [25] .
