Abstract-With Lattice-based cryptography (LBC), ciphertexts are represented as points near a lattice, and Babai's round-off algorithm allows to decrypt them when one knows the secret-key. Recently, an accelerated variant of the round-off, based on Residue Number Systems (RNSs), has been proposed. Herein, we combine this technique with the use of lattices of Optimal Hermite Normal Form (OHNF) and propose further refinements, so as to reduce the decryption complexity. This approach lends itself largely to datalevel parallelism, allowing for low latency decryption operations on multi-core CPUs with Single Instruction Multiple Data (SIMD) extensions, and achieves high-throughput on GPUs. Finally, we are able to perform decryptions up to 20 times faster than the most efficient implementation in related art, which exploits the Mixed-Radix System (MRS), in an Intel i7 6700K CPU, and we are able to decrypt up to 11,832 messages/s in a Titan X GPU.
Ç

INTRODUCTION
L ATTICE-BASED cryptography (LBC) is one of the main branches of modern cryptography. Since its proposal in 1996 by Ajtai [1] , research on LBC has been increasing at a fast pace, and it is believed to form a true alternative to all cryptosystems based on the factorization and discrete logarithm problems [2] , such as the widely used Diffie-Hellman [3] key-exchange protocol and RSA [4] . Unlike factorization and discrete logarithms, the problems LBC relies upon are thought to be hard even in a post-quantum setting. Beyond post-quantum security, many other properties contribute to the increasing importance of LBC. Primitives based on q-ary ideal lattices are related to special properties, such as security proofs connecting the hardness of the average-case Learning-with-Errors (LWE) [5] to that of the worst-case Shortest Vector Problem (SVP), or fully homomorphic encryption [6] .
Two schemes were proposed supported on Ajtai's ideas [1] , the Goldreich-Goldwasser-Halevi (GGH) [7] and NTRU [8] . While GGH does not support any security proof, it does not require the use of specific structural conditions, like NTRU. This additional structure might become a security hazard in the future, should new theoretical developments make attacks on the underlying security assumption feasible for practical parameters [9] . GGH also benefits from high asymptotic efficiency, since its time complexity is in Oðlog 2 ðkeyÞ 2 Þ, when compared with Oðlog 2 ðkeyÞ 3 Þ for RSA. Originally proposed with parameter sizes making it viable in practice, encryption and signature GGH schemes however contained major flaws [10] , [11] , forcing the lattice dimensions to be very large, essentially to prevent attacks based on lattice reductions. Thus, many consecutive works were focused on thwarting these attacks [12] , [13] as well as bringing practical improvements [14] , [15] , [16] to make the scheme viable under safe parameters. More recently, [17, Section 5.4 .2] provided a framework for instantiating trapdoor functions in a GGH style, but with security features related to hard lattice problems, such as the LWE.
A lattice is the subgroup of all linear combinations with integer coefficients of a base's vectors. For dimensions greater or equal to two, each lattice has an infinite amount of possible bases. In particular, almost orthogonal bases are best suited to solve the Closest Vector Problem (CVP) problem. CVP corresponds to finding the closest lattice point to a vector in space, which distances at most 1 ðLÞ=2 from the lattice, where 1 ðLÞ is the norm of the smallest nonzero vector of the lattice. In a GGH-like scheme, the private-key corresponds to a nearlyorthogonal base of a random lattice, with which it is possible to solve the CVP. In contrast, the public-key is noticeably skewed, making it inapt to solve the CVP. In order to encrypt a message, a user encodes it as a small "error" vector which is added to a point of the lattice. The recipient of the encryption, will compute the closest lattice point, and output the difference between the cryptogram and the lattice point as the plaintext.
Whereas GGH encryption is in general a very efficient operation, solving the CVP is burdensome. Babai proposed in [18] the Round-Off algorithm, which solves the CVP when a good basis of the lattice is known, and is parallel in nature.
However, this algorithm requires handling very large integers, which might become a performance bottleneck. With the Residue Number System (RNS), the processing of large numbers can be split into multiple channels, which operate independently from one another. When using this system, the overhead associated with adding and multiplying multiprecision integers is removed, and one can take advantage of the architectures' optimizations provided by most general purpose computing platforms for integer arithmetic. Furthermore, since the bit-width of the RNS channels can be tailored for the targeted system, optimizations supported on this system can be easily extended to multiple platforms. Finally, since each channel operates independently, its usage benefits the exploitation of data parallelism.
The use of RNS in the context of GGH decryption was first suggested in [15] , but its usage remained quite inceptive. Then, another work [19] endeavored to adapt the formulation of the round-off algorithm so that its shape fitted an RNS representation better. However, it was mandatory to use an alternative number system, which limits the advantages of RNS. This approach was implemented and tested in [20] . Finally, a recent work [21] provided a technique in order to obtain a full RNS version of the round-off algorithm, but it was only tested in an FPGA with limited results.
Herein, we propose the usage of lattices with an Optimal Hermite Normal Form (OHNF), within the context of [21] , as well as a new refinement, which allows to reduce the complexity of the decryption algorithm. Furthermore, we describe for the first time the implementation of a full RNS version of the round-off algorithm in both CPUs with Single Instruction Multiple Data (SIMD) extensions and in GPUs. The implementation is tested across multiple platforms, so as to analyze its scalability and applicability. In particular, we are able to perform decryptions up to 20 times faster than the most efficient round-off implementation in related art, which exploits the MRS, in an Intel i7 6700K CPU, and we are able to decrypt up to 11,832 messages/s in a Titan X GPU.
BACKGROUND
In this section, an overview of LBC, and in particular of GGH will be given, as well as of the RNS.
Conventions and Notations
Integer or real scalars are denoted by lower-case non-bold Latin or Greek characters (a, x, ", etc), vectors by lower-case bold latin characters (c c, v v, etc), and matrices by upper-case bold latin characters (R R, H H, etc). The vectors are represented in rows. Then, a vector-matrix product is denoted like vR vR. For a 2 Q, b ¼ bae is used to denote the closest integer b 2 Z to a, with ties broken upward. This notation is naturally extended to vectors. Tildes ($), when used as a relational operators, are used to represent that two values are of the same order of magnitude.
Lattice-Based Cryptography
A (full-rank) integer lattice L in R n is a discrete subgroup r r 1 Z È . . . È r r n Z of ðR n ; þÞ. fr r 1 ; . . . ; r r n g are independent integer vectors. If R R is the matrix whose rows are the r r i 's, then L is the set fvR vR j v v 2 Z n g. R R is a basis of the lattice L, and as soon as n 5 2, any (full-rank) lattice owns infinitely many basis. Two different integer matrices R R and H H are basis of a same lattice if, and only if, the product HR HR À1 is a unimodular matrix, HR HR À1 2 SL n ðZÞ (the special linear group [22] ). Given the previous basis R R, the fundamental domain (cf. Fig. 1 
A fundamental domain is a set of representative of the torus R n =L; the choice of a basis determines a non-overlapping tiling of the whole space,
Computing a Close Vector with Babai's Round-Off Algorithm
For solving the CVP, Babai's Round-off can be used whenever a "good" basis is known. Given such a basis R R of a lattice L, and a target vector c c 2 Z n , Babai's algorithm returns (1) as a vector of L close to c c
(1) The principle is to change the coordinates by expressing c c in basis R R. This maps L on the lattice Z n . Then, by rounding each component of the image of c c towards its nearest integer, a vector of Z n close to cR cR À1 is provided. Finally, the result is sent back to the original set of coordinates, giving an approximation of the closest vector of L to c c.
The quality of the result depends on the shape of D L ðR RÞ as well as on distðc c; LÞ. A simple way to describe a set of targets c c for which the Round-off algorithm returns the closest vector relies on the use of the infinity norm. After the first change of coordinates, the algorithm returns the closest point in the grid Z n for the distance induced by the infinity norm. Within the original set of coordinates, it means that the Round-off returns the lattice point v v such that c c belongs to the tile v v þ D L ðR RÞ. In other words, it returns the unique lattice point v v which belongs to c c þ D L ðR RÞ. Thus, the more orthogonal the fundamental parallelepiped, the closer to the target the result will be.
Classical Lattice-Based Cryptosystems Relying on the CVP
Computing close vectors in a lattice is computationally hard. This has led to conceptually simple cryptosystems. In the GGH scheme [7] , a full-rank lattice L is represented by two basis. One of them, R R, is an exceptionally "good" basis (in terms of orthogonality) and is a secret key, whereas the other one, H H, is a "bad" basis and is set public. The scheme works as soon as the shape of D L ðR RÞ is orthogonal enough so that it contains a set P s;R R ¼ ½Às; s n \ Z n . This is used as the plaintext space. A ciphertext is the sum of a lattice point and a plaintext. The decryption is realized by using the Round-off algorithm. Therefore, the decryption is guaranteed to be correct for any ciphertext in the following space:
A decryption consists in the following computation:
Eq. (2) leads to a sufficient condition (4) on s which guarantees a correct decryption for any plaintext in P. It is usually called Babai's condition
Fig. 2 depicts the volume (in gray), contained in each tile centered in the lattice points and shaped by D L ðR RÞ, inside which the target is assumed to lie (black dots represent P s;R R with s ¼ 3, i.e., integer vectors). The Round-off algorithm will return the closest lattice point whenever the target lies in such a volume. Meanwhile, D L ðH HÞ, and more generally any fundamental parallelepiped associated to a (polynomial-time) reduction of H H, should not allow to recover a plaintext by the simple use of (3). The principle of such scheme was extended to q-ary ideal lattices through the NTRU scheme [8] .
Due to the dimension which is constrained to be quite high regarding all previous attacks, the scheme was enhanced several times by acting on the public key size which was the major problem. Micciancio [14] proposed the Hermite Normal Form of the secret key R R as being the public key H H, that is, H H i;j ¼ 0 if i < j, H H i;j 51 if i ¼ j and H H i;j < H H j;j if i > j. It permits to decrease the public key size into Oðn 2 log nÞ bits, instead of Oðn 3 log nÞ initially. Since the HNF is an invariant of the lattice, and since it can be computed in polynomial time, giving the HNF of the secret key does not weaken the original scheme. In this new variant, a plaintext p p is encrypted by being reduced modulo H H.
Due to previous successful attacks [10] , [11] and constantly evolving lattice reduction algorithms, secure dimensions for a GGH-like scheme are large, making the computation of the Round-off burdensome. By following the general guidelines of Verheul and Lenstra [23] to determine minimal key sizes which would be computationally secure regarding today's unfeasible number of MIPS-Years, it appears that safe dimensions should be around 1,000 [24] .
Residue Number Systems
A significant overhead is introduced when adding and multiplying numbers whose bit-width is larger than the one supported by a single instruction of general-purpose processors. We define this type of arithmetic to be multi-precision. Exploiting the Residue Number System within this context significantly enhances the level of parallelism, by splitting the large numbers into multiple channels, which fit within the processor word and can be processed independently.
Definitions
Existence of the RNS is due to the Chinese Remainder Theorem (CRT). Given any set of pairwise coprime moduli B ¼ fm 1 ; . . . ; m k g (an "RNS base"), the following morphism is bijective
This induces a non-positional number system, in which any integer belonging to the "dynamic range" ½0; M B Þ, where M B ¼ Q m2B m, is completely determined by its residues modulo the m i 's.
The residues of any scalar s (resp. vector v v or matrix R R) will be denoted as ðs m Þ m2B (resp. ðv vÞ m2B , ðR RÞ m2B ) in the RNS defined by the base B. And a residue is usually denoted like jsj m . Also, in an RNS base, the term channel refers to a single modulus and its associated arithmetic, i.e., a ring Z=ðmÞ.
RNS Arithmetic
The main advantage of RNS is that one can implement the arithmetic independently for each channel. The absence of carry-propagation between residues within addition, multiplication and exact division allows to downsize the complexity of such basic operations through concurrency. Indeed, for any integers x; y, the quantity jx Å yj M B is completely determined by the residues jx Å yj m À Á
m2B
. Thus, some computation schemes such as the sum-of-products can be performed very efficiently in RNS. Nevertheless, not all standard and common arithmetical operations can be performed so efficiently in RNS. This is in particular the case of modular reduction and comparison. Whereas the comparison may be achieved by a conversion towards the positional MRS system, the RNS modular reduction involves the use of RNS base conversions.
RNS Modular Reduction. The state-of-the-art algorithms [25] are adapted from Montgomery's reduction [26] . Given an integer jxj < p 2 to be reduced modulo p and expressed in a radix-b representation, Montgomery's idea is to avoid costly divisions by computing an integer q such that x þ qp is a multiple of b. Thus, if q is carefully chosen, the exact division xþqp b n is a simple right shifting of b-words. To adapt this approach to RNS representation, an extra RNS base B 0 is used along with the main base B. In this case, the quantity q must imply that x þ qp is a multiple of M B . For that purpose, q ¼ j À x=pj M B is computed in RNS in the base B. Then, the RNS representation of q has to be somehow "extended" to the base B 0 . Once this is achieved, the exact division
can be performed in B 0 . Finally, the representation of the result can be extended to B if necessary. Denoting such base extension procedure by Bex, the main framework of an RNS modular reduction performed in a base B 0 is the following one:
A first issue of Montgomery's approach is that the result is congruent to jxM À1 B j p modulo p. In order to address this problem, one can either pre-multiply x by jM B j p and, consequently, increase the size of B to compensate bigger inputs, or integrate the Montgomery representation in a precomputed factor.
A second issue is the final result which can be not completely reduced modulo p. This is in particular due to the base extension applied to q. As described in next paragraph, such procedure can provide the residues of q þ aM B for a small integer a 5 0. For instance, by assuming that pÂ M B > jxj, the computation of x Â jyM B j p mod p through RNS Montgomery reduction would provide in B 0 the result jxyj p þ dp with d 2 ½À1; a þ 1. If the context requires a full modular reduction, it is possible to use an RNS-to-MRS conversion, which allows comparisons and then recovery and correction of the overflow dp.
Base extensions. A base extension is an algorithm which, given residues in a base B of an integer s 2 ½0; M B Þ, computes the residues of s in a second base B 0 which is coprime to B. Such operation mainly consists in computing in the channels of B 0 an algorithm which more generally converts ðs m Þ m2B into a classical positional number system.
In practice, two main techniques are used to recover an integer from its residues. They rely either on a Lagrange's or Newton's style interpolation. The first one comes from the classical constructive proof of the CRT. An integer s lying in the range ½0; M B Þ can be recovered from its residues in B by
with the following notations
Computing the final reduction modulo M B (i.e., recovering the integer k B ðsÞ) in (6) is the most delicate part. There are ways to do it quite efficiently [27] , [28] . However, a fast conversion without computing it [25] remains the most efficient type of extension. This is especially true when it is performed on generic concurrency-oriented architectures such as GPUs, multi-core CPUs, etc. [29] , [30] , because one can fully exploit the intrinsic concurrency properties of RNS arithmetic. In some cases, this type of incomplete extension can be sufficient. In its fastest form, a base extension boils down to a k Â ' matrix multiplication, where k (resp. ') is the number of residues in the start base B (resp. final base B 0 ). Such base extension algorithm is denoted by
Eq. (7) provides s þ aM B for an integer a in ½0; kÞ. In following discussions, such type of extension is said to be possibly "incomplete". The second type of conversion relies on another number system, called Mixed Radix System (MRS). Contrary to RNS, a MRS is a positional system. The computation of MRS coefficients from RNS residues follows the same mechanism as divided differences in Newton's interpolation. The choice of an order among the moduli of an RNS base B ¼ fm 1 ; . . . ; m n g determines an MRS base f1; m 1 ; m 1 m 2 ; . . . ; m 1 . . . m nÀ1 g. Then, the MRS coefficients in this base, and consequently s, arẽ
Once the MRS coefficients have been obtained, s can be reconstructed by
The sum in (9) is computed in the arrival RNS base. Contrary to (7), the advantage of this kind of base extension is that it gives access to a positional system. Moreover, the way that s is reconstructed in (9) implies that the extension is always "complete". However, there is a huge drawback. The computation of MRS coefficients (8) is intrinsically sequential. Thus, such base extension procedure is hardly parallelizable, and may noticeably slow down an RNS algorithm.
COMPARISON BETWEEN THE STATE-OF-THE-ART ROUND-OFF APPROACHES
In this section, the algorithms proposed in [19] , [20] and [21] for the computation of Babai's round-off algorithm will be examined. With this analysis, the best features of each approach will be extracted, so that they can be combined and further enhanced in the next section.
In the present context of the CVP, where the distance induced by the infinity norm is bounded up by s, it is enough to compute (1) modulo a small m s such that m s 5 2s þ 1. In other words, it suffices to recover the small "noise" p p by computing ðc c À bcR cR À1 eR RÞ mod m s . However, the round-off is a bottleneck because it is necessary to compute cR cR À1 with a sufficient precision in order to recover the result.
As previously noticed, RNS is of a non positional nature. So the rounding function must be adapted in order to be done as efficiently as possible in RNS. In [19] , the following reformulation, more compliant with RNS, has been proposed
where R 0 R 0 is the adjoint matrix of R R, and v v 1 is the all-one vector. Consequently, the bottleneck of the computation is transferred to an RNS modular reduction. Thus, it is needed to introduce an RNS base B to perform this modular reduction.
Since R 0 R 0 and det R R are constant inputs in the context of a GGH like scheme, the Montgomery representation can be managed through these inputs. To do so, the precomputed matrixR R ¼ j2M B R 0 R 0 j 2 detðR RÞ and vectord d ¼ j detðR RÞM B v v 1 j 2 detðR RÞ , which corresponds simply to detðR RÞv v 1 , are used. It follows that the modular reduction is given by the following formula (where d :¼ detðR RÞ and
However, obtaining a complete reduction is not that easy. The overflow v v o can be due both to the intrinsic structure of Montgomery's reduction itself and to an incomplete base extension from B. Two solutions have been proposed to address this problem [19] , [21] , the former of which has been improved and tested in [20] . The first one proposes to go through the MRS positional system in order to have access to final comparisons which allow to complete the modular reduction. The second one exhibits a trick which permits to only stay in RNS (i.e., no comparison is needed anymore).
MRS-Based Approach [19]
When computing (11) modulo m s , it is required to extend the quantity j-ðcR cR þd dÞ=2dj M B from B. Even if this extension is complete, the result of the Montgomery reduction is not guaranteed to be completely reduced yet. By noticing this, it becomes more interesting to do it with an efficient but inaccurate extension like (6) . The final correction of the modular reduction will then be made by comparing it with all possible multiples of 2d. This requires introducing another large RNS base B 0 , and to proceed with an RNS-to-MRS conversion in this new base. However, it is possible to reduce the number of comparisons to perform, i.e., to bound up kv v o k 1 by 1. By adding a small modulusm besides B 0 , and which verifiesm > kv v o k 1 , one can perform a second Montgomery reduction inside B 0 via a base extension fromm towards B 0 (which is actually a simple copy-paste).
The size of B and B 0 determine the complexity of this approach. They are determined by the conditions M B > kc ck 1 and M B 0 > 2 Â 2d. Moreover, in the HNF variant of GGH, kc ck 1 ¼ jc c 1 j < d. Consequently, if any modulus is smaller than b ¼ 2 v (i.e., v is the bit-size of RNS moduli) and k ¼ dlog b ðdÞe, the cost of the whole computation is dominated by the first base conversion, which is essentially k 2 n, and also by the RNS-to-MRS conversion, which is carried out through kðkÀ1Þ 2 n elementary RNS multiplications (9) (that is a multiplication in a small ring Z=m i Z), and involves multiple synchronization points.
Algorithm 1 details this strategy. Its cost for each component is clearly quadratic in k, because of the base extension between B and B 0 and because of the RNS-to-MRS conversion in B 0 . Drawing from [16] , in [20] it was proposed the use of lattices of Optimal Hermite Normal Form within the context of Algorithm 1. The HNF of a lattice is said to be optimal if, except for its first column, all its elements are equal to the identity matrix. This is for instance almost always the case when the determinant of the lattice is a prime number. This property significantly reduces the size of the public-key, as well as of the ciphertext. In particular, cryptograms have a single non-zero entry, which means that the vector-matrix multiplications of lines 1, 4 and 16 reduce to scalar-vector multiplications. Despite this reduction of complexity, the efficiency of the implementation of [20] in multi-core CPUs and GPUs was limited because of the costly conversion between an RNS and a MRS representation. Algorithm 1. RNS Round-Off with RNS-to-MRS Conversion [19] Require: c c, 
RNS-Based Approach [21]
In the previous approach, the overall cost is mainly due to the use of the large base B 0 . This dramatically increases the cost of the first base extension. Moreover, the RNS-to-MRS conversion remains the main reason limiting the efficiency of this technique. Indeed, besides the fact that such base extension has a quadratic cost in the number of moduli, it is also hardly parallelizable.
In [21] , a new solution has been proposed so that a large base B 0 is not required anymore. When Babai's condition (2) is written as
the value " reveals a thin area located between the fundamental parallelepiped described by R R and the ball B 1 ð0 0; 1 2 Þ Â R R which contains the image of plaintext space in basis R R. The strategy relies on the existence of this gap. In particular, it relies on the fact that the introduction of a small extra error to ciphertexts is possible without changing the correctness of decryption. Such error could come from an incomplete modular reduction in (11) for instance. The strategy proposed therein is then quite different than the previous one. Since only integer vectors are considered, any extra error coming from the use of RNS is integer. So, instead of exploiting the gap " as is, the first step is to scale the target vector c c by an integer g. By doing that, the new gap g" in the lattice gL is expected to become large enough so that it is able to contain the error due to the RNS approach of rounding-off (10) . Fig. 3 illustrates the principle. The lattice L is scaled by g. Thus, gR R is a base of gL, and gD L ðR RÞ ¼ D gL ðgR RÞ. A target vector c c 2 P þ L is turned into gc c 2 gP þ gL. Whereas, initially, the round-off sends any real vector in the ball B 1 ð0 0; 1 2 Þ to (the lattice point) 0 0, the RNS technique relies on the use of modc g (centered remainder) which is expected to send integer vectors in B 1 ð0 0; g 2 Þ to 0 0. When gc c is assumed to lie in the light gray volume of the right part of Fig. 3 (meaning gP s , i.e., gP s;R R for the considered basis R R), the dark gray volume, due to Babai's condition (12) , acts as a safety-net which can contain some integer error due to an approximated RNS base conversion/RNS round-off computation. The main new feature of this technique is that only a single-modulus base g is required instead of B 0 (under the condition that g" is, roughly, the size of the extra error).
The vector bgcR cR À1 e can be viewed as the sum gbcR cR À1 eþ bgpR pR À1 e. Thus, bgpR pR À1 e could be obtained by computing bgcR cR À1 e modulo g. In that case, Babai's condition is simply gpR pR À1 2 ½À
However, the use of the RNS Montgomery reduction introduces an extra error v v e , and one is only able to obtain
So the actual quantity that one needs to know is bgpR pR À1 eþ v v e , in order to correct it. If g is carefully chosen such that g" > kv v e k 1 , then the computation of the centered remainders of RNSðbgcR cR À1 eÞ modulo g allows to obtain bgpR pR À1 eþ v v e , and consequently bcR cR À1 e. In [21] , the size of g is related to " and jBj in the following way: g5 jBj " l m
. It follows that this technique is efficiently applicable as soon as g can be written as a single b-word. In practice, such a word could be 32 bits wide for instance. The technique would then be available for basis R R verifying (4) with "4 jBj 2 32 . The probability that such a condition is not verified as soon as (2) is true seems quite small in practice. Heuristically, one could expect this conditional probability to be asymptotically approximated by the measure of Â Ã under the density of a continuous random variable. Hence, this acceleration technique for a random basis verifying the initial Babai's condition should be realistically applicable as is in practice. This system was implemented in a FPGA with limited results. Unlike the previous approach, lattices were not restricted to be of OHNF, which greatly increased the complexity of the computation. Thus, only lattices of small dimension fitted on the available FPGA hardware resources. Nevertheless, it should be noted that, since the proposal of [21] avoids the computation of mixed-radix digits, it should be more amenable to parallelization, because it features a reduced need for synchronization.
PROPOSED IMPROVED DECRYPTION ALGORITHM
In this section, we propose a system where lattices of OHNF are used for a cryptographic system with decryption completely performed in the RNS domain, along with further refinements. OHNFs allow to significantly reduce both the time and space complexity of the decryption algorithm. When RNS is applied to this system, a larger amount of parallelism is made available, and the complexity of the algorithm is further reduced. Finally, novel refinements minimize the amount of precomputations and operations required by the procedure.
The first contribution of this paper is the application of lattices of OHNF to the decryption algorithm proposed in [21] . Keys are generated as proposed in [16] , wherein private-keys are generated as nearly-orthogonal and randomly rotated lattices basis. The random process of generating a private-key is repeated until the corresponding public-key is of OHNF. The corresponding message space is P s;R R ¼ and a plaintext p p is simply encrypted by computing p p mod H H ¼ ðc; 0; . . . ; 0Þ. Moreover, c c ¼ ðc; 0; . . . ; 0Þ can still be written as p p þ kR kR for a certain k k 2 Z n . Therefore, the decryption function is still (3).
In the same manner that the usage of lattices of OHNF was beneficial in [20] , in this case, one avoids having to perform a vector-matrix multiplication for each RNS channel, which is replaced by a scalar-vector multiplication. Moreover, this significantly reduces the storage space of the public and private keys as well as of the cryptograms.
As a second contribution, we propose a refinement to the technique presented in [21] . In essence, the computational cost of the rounding operation is reduced at the cost of introducing new errors, but these errors do not change the final result since they are contained within the gap g of Fig. 3 . Instead of computing a rounding bcR cR À1 e, it is suggested to only compute a flooring bcR cR À1 c, which allows to reduce the amount of precomputed data, and simplify the decryption algorithm. In particular, one avoids having to store the value of 2d in RNS, as well as any additions related to it, further reducing the memory accesses required by the decryption algorithm, which were one of the main bottlenecks of performance in [20] .
Computing the flooring may cause an addition of a vector v e v e 0 with coefficients in fÀ1; 0g to the extra error v e v e , which should not impact the size of g. The consequence is that it is now sufficient to compute the following quantity:
v
2 c, it is equal to its centered remainder modulo g. This is summarized in the following proposition, which sets a sufficient bound for g for this new refinement.
Let B be an RNS base with M B > kc ck 1 . Let d and
and,
Proof. Because of the hypothesis M B > kc ck 1 , it is clear that k cR cR M B k 1 < d, and that any coefficient of r r verifies 04r r i < dð1 þ jBjÞ. It follows that r r ¼ jgcR , then by definition of modc we haveṽ ẽ v e modc g ¼ṽ ẽ v e . Moreover, it is easy to show that, for both even and odd g's
This concludes the proof. t u Algorithm 2 summarizes this strategy. The cost is dominated by operations in base B, whose size is k $ log b ðdÞ, and is linear in k. 
THEORETICAL ANALYSIS
A comparative theoretical asymptotic analysis is herein led between the MRS and RNS variants of the Rounding-off and a standard multi-precision approach. For fairness, we assume that all approaches use lattices of OHNF, and hence the RNS approach corresponds to the one proposed in Section 4. The goal is to estimate the order of magnitude of the gain which could be expected, in order to get references for further interpretation of the implementation results in the next part.
An atomic RNS multiplication is a small multiplication of two residues followed by a reduction by the underlying modulus. Such an operation costs more than a single-word multiplication. But in practice two points allow to minimize the extra cost of the modular reduction. First, when one computes sums of products, as it is the case in such a lattice context, using a lazy approach decreases the number of such reductions. Second, by using pseudo-Mersenne moduli, i.e., of the form m i ¼ 2 r À c i with c i < ffiffiffiffiffiffi m i p , a modular reduction represents roughly $ 3 4 of an elementary multiplication plus a few additions. Consequently, in the following analysis the extra cost of modular reductions inside RNS channels does not influence the asymptotic comparisons.
Time Complexity
Let one use Hadamard's bound on d ¼ detðR RÞ, i.e., log ðdÞ 2 Oðn log nÞ. By hypothesis, M B ; M B 0 $ d. Then the costs of the MRS and RNS approaches are respectively Oðn 3 ðlog nÞ 2 Þ and Oðn 2 log nÞ elementary multiplications. When the round-off is done via a classical multi-precision computation, the precision at which R R À1 should be precomputed has to be determined. Supported by the discussion in [7] , let one assume that
is an approximation of R R À1 used to compute bcR cR À1 e. The precision t must be such that
Consequently, a sufficient precision is given by 2
jðR R À1 Þ i;j j; 1 4 j 4 nÞ > 0. As explained in the previous part, the integer g in the accelerated RNS technique is basically proportional to ". This leads to the conclusion that the computational cost of bc c 1 R R 1;Ã e, when considered with a value 1 " $ g bounded up by the radix b for a fair comparison with the previous full RNS approach, is Oðn 2þc ðlog nÞ 1þc Þ elementary multiplications, where c depends on the multiplication algorithm (c ¼ 1 for schoolbook, c $ 0:585 for Karatsuba, etc). Table 1 summarizes this discussion about asymptotic complexities. Theoretically, the MRS approach seems not competitive comparatively to a classical multi-precision computation. However, two points could be in favor of the MRS algorithm. First, the theoretical bound of the underlying multi-precision multiplication algorithm (i.e., coefficient c) could be not that influential in practice, and it should be considered cautiously. Second, even if the RNS-to-MRS conversion is intrinsically sequential, the rest of the algorithm, in particular the first base extension, can take advantage of the parallelization features of RNS in order to accelerate the process.
Concerning the full RNS approach, one can expect to reach noticeable speed-up factors, since the asymptotic bound turns out to be an order of magnitude lower than the multi-precision's one. This fact will be tested in practice in Section 7.
Space Complexity
The precision t obtained previously in order to get a sufficient approximation of R R À1 indicates that the size of precomputations, which are part of the secret key, for a multiprecision approach is Oðn 2 log nÞ bits. Actually, it has been noticed that one needs approximately log ðdÞ À log ð"Þ bits, which is also the size of B [ g in the full RNS approach.
The precomputed data for both RNS techniques are clearly dominated by the (first row of) matricesR R in bases B and B 0 , making their size Oðn 2 log nÞ asymptotically. In particular, this size for RNS is half the one of MRS. Moreover, the size of the precomputations should be almost identical for both multi-precision and full RNS.
PARALLELIZATION AND IMPLEMENTATION DETAILS
In this section, we provide details on the parallelization and efficient implementation of Algorithm 2. First, we detail on how such an implement can take advantage of the different levels of parallelism supported by general purpose computing platforms, such as GPUs and multi-core CPUs with SIMD extensions. In particular, since GPUs provide for massive parallelism to be exploited, allowing for the concurrent execution of a large number of work-groups, they are specially useful for the decryption of multiple messages in parallel. Moreover, the combination of vector operations along with RNS number processing, makes the system suitable to SIMD parallelism, where the same instruction is used to process multiple data at the same time.
Previous research suggested that the use of GPUs might not be useful for GGH-style algorithms [20] . Herein, we discuss on how the Algorithm proposed in [21] , along with the refinements proposed in this paper, improves data parallelism and reduces the need for synchronization, making GPU implementations a competitive alternative to CPUs.
Algorithm and Precomputations
An optimized version of Algorithm 2 can be found in Algorithm 3. Lattices with OHNF were used, wherein input ciphertexts have its n À 1 last components set to 0, and c c 1 ¼ c lies in ½0; detðR RÞÞ. This allows to reduce the complexity of the overall algorithm, as shown in Section 5. Furthermore, the use of the proposed refinement reduces the amount of precomputed data and computational steps, by removing the need to precomputed d in Algorithm 1, as well as any additions related to it. Thus, the required precomputed data is as follows (we recall the notationR R ¼ jgM B R R (17), then Algorithm 3 is correct.
Parallelization
A level of parallelism to be considered for massively parallel architectures, such as GPUs, is that of deciphering multiple messages in parallel, by instantiating the same algorithm in multiple work-groups. This type of implementations is particular useful for server settings, where multiple connections are being handled at the same time, and thus several decryptions are required to be performed simultaneously. A similar approach could be applied to multi-core CPU architectures with SIMD extensions. Each SIMD lane would then be used to process the decryption of a different ciphertext in parallel. Nevertheless, as explained in the remainder of this section, the underlying arithmetic provides a great level of In vector and matrix arithmetic, an inherent level of parallelism relies on the independence of components along the columns. In the case of GGH, the round-off is mainly composed of scalar/vector products. Thus, the set of products fc c 1 R R 1;i j i 2 ½1; ng in multi-precision or fc c 1R R 1;i j i 2 ½1; ng in RNS can be easily scattered on a set of CPU cores or GPU work-items.
Moreover, the RNS offers a second level of concurrency (cf. Fig. 4 ). The basic arithmetic operations are made independently and in a carry-free way on the residues. This can be exploited in different ways, depending on the available resources. For instance, on a multi-core CPU, whereas the first level of parallelization can be supported by multithreading, the second level can be handled by SIMD instructions sets. Similarly, on a GPU, each work-item can process a single residue without any dependency on other residues. The exploitation of data parallelism is not as viable with multi-precision arithmetic, since the carry propagation chains introduce data and control dependencies, which are not efficiently handled by SIMD and GPU technologies.
In [20] , these two levels of parallelization have been implemented on GPUs and multi-core CPUs to test the efficiency of the first basic RNS/MRS approach. It turned out that the memory bandwidth of GPUs was a critical issue, which led to better results on the CPU. In this paper, with the removal of the computation of mixed-radix digits, memory accesses are significantly reduced and, moreover, data dependencies are kept to a minimum. The proposed refinements also contribute to reduce the amount of data that needs to be accessed during the execution of the algorithm. Hence, we show that whereas CPUs can be used to provide low latency implementations of Algorithm 3, GPUs can attain much higher throughput.
Implementation of the Parallel Algorithm
The proposed algorithm was implemented targeting Intel CPU multi-core architectures with the AVX2 SIMD instruction set, and NVIDIA GPUs.
1 Several Application Programming Interfaces (APIs) were used in this work to exploit different levels of data parallelism, namely OpenCL [31] for GPU programming, OpenMP [32] for exploiting multithreaded CPU parallelism, and AVX2 intrinsics [33] for SIMD parallelism:
OpenCL adopts a C-like programming language to allow developers to express data parallelism. It supports a kernel keyword, which is used to characterize a function that will be spawned multiple times and executed in parallel. The instances of this function are work-items, which are distinguished by their ID, and organised in work-groups. Work-items in the same work-group may not only share data among themselves, but also use barriers for synchronization. The device where the computation takes place is called the compute device. It is controlled by an OpenCL host device using a command queue. OpenMP provides a set of compiler directives and library routines for thread creation, workload distribution, and synchronization. As an example, #pragma omp parallel creates a set of threads, the number of which is typically equal to the number of cores; #pragma omp for splits the execution of loop iterations among the threads; and #pragma omp barrier creates a synchronization point, where each thread waits until all of the others have reached this point. AVX2 consists of a set of extensions to the x86 instruction set architecture. If offers eight 256-bits registers, and a host of instructions to operate on them. The registers are considered as a vector of integers, and instructions operate on each of the lanes individually. These technologies were exploited by applying the RNS approach described in Fig. 4 to Algorithm 3, the result of which is graphically represented in Fig. 5 . First, all of the RNS arithmetic has been implemented from scratch, using native instructions of the CPU and GPU. For the CPU, since the targeted SIMD architecture supported 64-bit arithmetic, moduli were chosen to be smaller than 2 32 , so that residues then fitted on uint_32t integers, and any product of residues could be handled through the uint64_t type. In contrast, since the GPU only supported 32-bit arithmetic, moduli smaller than 2 16 were chosen. The moduli of the large RNS base B are systematically chosen in pseudo-Mersenne form, that is m ¼ 2 r À c. For the CPU, these moduli satisfied c < ffiffiffiffi ffi m p . The intrinsic modular reduction, inside the rings Z=mZ, can then be carried out very efficiently through bitwise AND, shifting, and few multiplications with the constant c, as shown in Algorithm 4. This algorithm is based on the following congruency:
For the GPU, due to the smaller bit-width of the moduli, the equation c < ffiffiffiffi ffi m p could not be satisfied, and the equality
had to be applied more than twice in Algorithm 4. The precise number of times this assignment had to be applied was computed by taking into account the largest number to be reduced. Algorithm 4 is particularly useful for the exploitation of dataparallelism within the context of SIMD architectures and 1. The source-code of the proposed system will be made publicly available at http://web.tecnico.ulisboa.pt/paulo.sergio/gghTc17.tar. gz GPUs, since it has only one control dependency. Whereas the "if" can be implemented in GPUs with the "if" keyword of C, for AVX2 it was implemented with the following code (where the minimum value between x½i and x½i À m mod 2 32 is obtained for each lane i of the SIMD register x)
x ¼ mm256 min epu32ðx; mm256 sub epi32ðx; mÞÞ: (21) By taking the moduli as previously described, the cryptogram c is converted into RNS by first splitting its binary representation into r-bit words, afterwards multiplying each ith word by 2 ðiÀ1Þr ¼ c iÀ1 mod m, and adding the products modulo m. This process can be identified in the top-left corner of Fig. 5 .
Algorithm 4. Modular Reduction for a Pseudo-Mersenne Modulus
The modulus m s , which is noticeably smaller than other "standard" moduli, is chosen to be a Mersenne number, m s ¼ 2 rm s À 1. This is twofold important. First, since it has a small bit-width, one can represent its residues using uint_8t integers, minimizing the amount of data one has to load, and maximizing the amount of operations carried out by each SIMD instruction. Second, the reduction algorithm is simplified, and corresponds to additions, since
When c is represented in binary using r ms -bit words, the process of converting it to channel m s corresponds to adding the words modulo m s , as illustrated in Fig. 5 . Since g has a different role than the moduli of B, it is also chosen as a Mersenne number, and the simplified reduction method also applies in this case. Both g and m s are chosen to be coprime with the moduli in B.
The conversion of c to RNS was always conducted in the CPU, even for the considered GPU implementation, since the splitting of c into words of irregular bit-widths is a memory intensive operation best suited to the CPU cache system. This process exploited both multi-threading and SIMD extensions. After this operation, threads on the CPU synchronize, and in the case of the GPU implementation transfer the RNS representation of c, along with all other precomputed data, to the GPU.
The multiplication of c B , c m s and c g byr r B ,r r m s andr r g , respectively, ensues. These operations roughly correspond to the multiplication of c c by the adjoint matrix of R R and g. For the computation of c m Âr r m mod m; 8m 2 B in the CPU, the work-load was uniformly split between the cores, and multiple residues were processed simultaneously using SIMD instructions. In particular, the matrixr r B was stored in dimension-major order but in groups of 8 residues, i.e., the first 8 contiguous memory addresses are filled with the residues of the first component ofr r with respect to m 1 ; . . . ; m 8 , the following eight addresses hold the residues of the second component ofr r with respect to m 1 ; . . . ; m 8 , and so on. SIMD accesses are optimized in this way, since AVX2 allows for the load of 8 uint32_t integers with a single instruction. In contrast, since the GPU had no SIMD extensions, each work-item was responsible for processing all residues of a single dimension (thus one requires a work-group with a number of workitems equal to the lattice dimension to process the decryption of a message). In this case, the matrixr r B was stored in regular dimension-major order (i.e., first, jr rj m 1 is stored, then the components reduced by the second modulo, and so on), so that memory accesses of work-items within a warp were coalesced. The multiplications c m Âr r m mod m; 8m 2 fm s ; gg followed a similar approach, but due to the characteristics of the moduli, have simpler reduction algorithms.
Afterwards, threads are synchronized, and s s m s and s s g are produced through a process essentially similar to a base extension. This allows one to compute t t ms ¼ ðða a g þ s s g Þ modc g þ s s ms Þ |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl} R R ms mod m s . It should be noted that this latter vector/ matrix multiplication benefits the most from the usage of uint8_t integers, since it is one of the most burdensome operations of the algorithm, and this data-type maximizes the throughput of the SIMD extensions. For the CPU implementation, the matrix R R m s was stored in column-major order in groups of 32 entries, since AVX2 allows one to load 32 8-bit integers with a single instruction. In contrast, R R ms was stored in regular row-major order when the message was processed by the GPU.
Finally, the decryption of the cryptogram can be produced by computing p p m s ¼ t t ms þ a a ms modc m s . For the GPU implementation, this computation is followed by a memory transfer, where the result is copied to the CPU.
In order to take advantage of the increased throughput of GPUs, the simultaneous decryption of several messages was also considered in this case. This allows one to maximize hardware usage, and amortizes the penalty of transferring the precomputed values to the GPU.
EXPERIMENTAL RESULTS
The decryption algorithm was implemented and thoroughly tested in three systems, namely iÞ an i7 4770K with a K40c, iiÞ an i7 5960X with a GTX 980, and iiiÞ and an i7 6700K with a Titan X, with the characteristics described in Table 2 . Both the i7 5960X and the K40c have a larger amount of cores, providing a larger amount of parallelism, than the i7 4770K and the GTX 980, respectively, but they operate at lower frequencies. Analyzing the execution in these sets of platforms enables us to understand which of these features has the most impact on the performance of the decryption operation. Moreover, we analyze how performance scales across different generations of the same devices, since whereas the i7 4700K and the i7 5960X feature the Haswell micro-architecture, the i7 6700K features the Skylake. Similarly, the K40c features the Kepler micro-architecture, whereas the GTX 980 and the Titan X feature the Maxwell.
All CPU code was compiled with gcc 4.9.2 with the -O3 flag, and execution times were measured using the high resolution clock provided by the standard library of C++. Furthermore, GPUs were programmed using OpenCL. The following labels will be used when presenting the experimental results: the label "RNS" corresponds to a CPU sequential implementation of the proposed algorithm; "RNS AVX2" to the parallel version of this algorithm, where both multithreading and AVX2 extensions are exploited; "MRS" to the sequential MRS algorithm proposed in [20] ; "MRS AVX2" to the parallel version of the algorithm proposed in [20] , where both AVX2 and multithreading are exploited; "NTL" to a multi-precision implementation of decryption using the NTL 9.11.0 library [34] compiled with GMP 6.0.0 [35] ; and "RNS GPU" to an RNS implementation on the GPU. Whereas the delay of the "RNS GPU" implementation comprises the decryption of 512 messages, the delay of the remaining implementations encompasses the decryption of a single message.
In Fig. 6 , the delay for the different CPU implementations on the i7 6700K can be found. As predicted in Table 1 , the delay of the MRS implementation proposed in [20] increases rapidly with the dimension of the lattice. Nevertheless, it increases the possibility of exploiting parallelism, which results in a smaller delay for the MRS AVX2 implementation than the NTL algorithm. In contrast, the decryption delay of the RNS and NTL approaches increases at a slow pace. Furthermore, the RNS approach provides greater opportunities to benefit from parallelism, which makes the proposed multithreaded AVX2 algorithm the most efficient with respect to the delay.
One can find how the considered CPU versions on the i7 6700K of the decryption algorithm compare to the NTL implementation in Fig. 7 . Whereas the MRS implementation is slower than the NTL, the increased data parallelism allows for the MRS AVX2 algorithm to be faster than NTL. It should be noted that it would not be possible to exploit parallelism to the same extent with a multi-precision approach, since carry chains prevent a direct application of SIMD technologies. While one could expect the speedup of the proposed approach to increase with the lattice dimension, since we have from Table 1 that decreases with n, because GMP selects which multiplication algorithm to use based on the length of the operands [35] . Nevertheless, one can conclude that the high levels of speedup are sustainable, unlike with the MRS AVX2 approach, for which the speedups decrease as the lattice dimension increases. In fact, an average of speedup of 68 is achieved for the RNS AVX2 approach. The delay of the RNS AVX2 decryption algorithm on the three considered CPUs, as well as of the RNS GPU implementations can be found in Fig. 8 . First, whereas the i7 4770K is faster than the i7 5960X for smaller dimensions, the execution time for the former does not scale as well as for the latter for larger dimensions. One can conclude that the larger amount of cores of the i7 5960X, combined with its larger memory bandwidth, enables an improved scalability, that has more importance than the higher frequency of operation of the i7 4770K for larger dimensions. Furthermore, the implemented approach has scaled well into the new Intel micro-architecture Skylake. Since Intel has provided very few details about this micro-architecture, it is difficult to interpret this result. Nevertheless, this improvement could result from either integer arithmetic optimizations, or from a better exploitation of Instruction Level Parallelism (ILP), since a large amount of instructions can be executed independently, due to the properties of RNS. Finally, the delay of the GPU implementations is much higher than that of the CPU versions. In fact, GPUs are optimized for large throughput, while sacrificing delay, when compared with CPUs. Moreover, whereas the delay is about the same for the GTX 980 and the Titan X GPUs, which share the same micro-architecture, it is much larger for the K40c GPU. This confirms that the algorithm benefits in general from architectural improvements such as the reduced arithmetic instructions latency of the Maxwell NVIDIA micro-architecture with regards to Kepler.
The throughput of the different implementations is represented in Fig. 9 . It should be noted that the considered GPUs only allow for work-groups with at most 1,024 workitems. Hence, for lattices with dimensions larger than 1,000, we have attributed to each work-item the processing of 2 dimensions, rather than one, as originally described in Section 6.3. We can see the sustained throughputs of the RNS GPU implementations are much higher than those of the CPU. This is in contrast with the results of [20] , and derives from the fact that the algorithm herein proposed requires a significant less amount of synchronizations and memory accesses, due to the avoidance of the computation of the MRS digits, and due to the refinement in Section 4. A maximum throughput of 11,832 messages/s is obtained for Titan X and for the dimension 1,000.
From the previous results, one can conclude that while multithreaded CPUs with SIMD extensions can implement lattice-based decryption with low latency, GPUs are preferable when one considers throughput as a figure of merit. Furthermore, the efficiency of these implementations was only possible due to the use of RNS, and due to the proposed approach which reduces the amount of synchronization and memory accesses that are required when compared with previous approaches [20] , making lattice-based decryption even more suitable for platforms exploiting data parallelism.
The typical size of the precomputed data required for the RNS, MRS and NTL versions of the decryption algorithm can be found in Fig. 10 . We can see that, as predicted in Section 5.2, the amount of extra data required by the RNS approach when compared with the NTL is negligible, with an average increase in size of 8.7 percent. Moreover, one can see that this size is more than halved for the RNS approach when compared with the MRS, since one no longer needs a second large base, and neither the precomputed data associated with it, nor the data associated with computing the mixed-radix digits.
RELATED WORK
In [36] , a GGH-style cryptosystem supported on lattices of OHNF was proposed. There, decryption took place as Fig. 7 . Speedup of the CPU implementations on the i7 6700K using the multi-precision approach as a baseline. Fig. 8 . Delay of the RNS AVX2 decryption algorithm on CPU platforms, as well as of the GPU version. The y-axis is in logarithmic scale. Fig. 9 . Throughput of the RNS AVX2 decryption algorithm on CPU platforms, as well as of the GPU version. The y-axis is in logarithmic scale. described in (3), except that cR R À1 detðR RÞ was computed using RNS, by computing each entry of the result for all moduli of the RNS base and reconstructing the values with the CRT. Afterwards, a multi-precision approach is used for the remaining steps of (3). This algorithm was concluded to be less efficient than the MRS AVX2 approach in [20] . Moreover, the method proposed in [20] was replicated herein as the parallel MRS AVX2 approach. One can conclude that by combining the method proposed in [21] with the refinements proposed in this paper, as well as by exploiting lattices of OHNF, the bottlenecks that prevented further efficiency improvements in [20] are removed, namely by reducing the amount of synchronization points and memory accesses. This enables an improvement in latency of up to 20 times for the dimension 1,800 when considering the RNS AVX2 implementation. It also allows to more than halve the amount of precomputed data.
Field-Programmable Gate-Arrays (FPGAs) have been previously considered for cryptographic implementations as a more energy-efficient accelerator alternative to GPUs. An implementation of an RNS lattice-based cryptosystem for FPGAs can be found in [21] . Despite avoiding the computation of MRS digits, it does not make use of lattices of OHNFs, nor makes use of the refinements proposed herein. The designed architecture did not fit the target FPGAs (Virtex-5 and Kintex-7) for lattices with dimensions greater than 128. Therefore it is hard to compare the experimental results with the previous implementations.
CONCLUSION
In this work, the proposals of [20] , [21] for the decryption operation of GGH cryptosystems exploiting the RNS were considered, improved upon, and concretized. In particular, we applied the technique of using lattices of OHNF to the scheme of [21] , and further simplified the decryption algorithm. Since the RNS allows to split large integer arithmetic into multiple channels, the presented methods are not only amenable to parallelization, but are also generic since they benefit from the integer arithmetic optimizations present in most computing platforms, while avoiding the overhead of multi-precision arithmetic libraries. In fact, it was shown that asymptotically the proposed approach is an order of magnitude faster than a multi-precision approach. The RNS techniques were implemented and executed not only on GPUs but also on CPUs. Whereas with GPUs a maximum throughput of 11,832 messages/s was obtained, a speedup of up to 20 was obtained when comparing the proposed CPU implementation with the most efficient of the state of the art, which exploits the MRS [20] . Finally, we conclude that even though the increased efficiency comes with the cost of having a larger amount of precomputed data, this increase is not significant. Fig. 10 . Size of the required precomputed data for the decryption algorithm.
