Fast implementation of Curve25519 using AVX2 by FAZ-Hernández, Armando & López, Julio
UNIVERSIDADE ESTADUAL DE CAMPINAS
SISTEMA DE BIBLIOTECAS DA UNICAMP
REPOSITÓRIO DA PRODUÇÃO CIENTIFICA E INTELECTUAL DA UNICAMP
Versão do arquivo anexado / Version of attached file:
Versão do Editor / Published Version
Mais informações no site da editora / Further information on publisher's website:
https://link.springer.com/chapter/10.1007/978-3-319-22174-8_18
DOI: 10.1007/978-3-319-22174-8_18
Direitos autorais / Publisher's copyright statement:
©2015 by Springer. All rights reserved.
DIRETORIA DE TRATAMENTO DA INFORMAÇÃO
Cidade Universitária Zeferino Vaz Barão Geraldo
CEP 13083-970 – Campinas SP
Fone: (19) 3521-6493
http://www.repositorio.unicamp.br
Fast Implementation of Curve25519 Using AVX2
Armando Faz-Hernández(B) and Julio López
Institute of Computing, University of Campinas,
1251 Albert Einstein, Cidade Universitaria, Campinas, Brazil
{armfazh,jlopez}@ic.unicamp.br
Abstract. AVX2 is the newest instruction set on the Intel Haswell
processor that provides simultaneous execution of operations over vectors
of 256 bits. This work presents the advances on the applicability of AVX2
on the development of an efficient software implementation of the elliptic
curve Diffie-Hellman protocol using the Curve25519 elliptic curve. Also,
we will discuss some advantages that vector instructions offer as an alter-
native method to accelerate prime field and elliptic curve arithmetic. The
performance of our implementation shows a slight improvement against
the fastest state-of-the-art implementations.
Keywords: AVX2 · SIMD · Vector instructions · Elliptic Curve Cryp-
tography · Prime Field Arithmetic · Curve25519 · Diffie-Hellman Pro-
tocol
1 Introduction
Nowadays, the use of Elliptic Curve Cryptography (ECC) schemes has been
widely spread in secure communication protocols, such as the key agreement
Elliptic Curve Diffie-Hellman (ECDH) protocol. In terms of performance, the
critical operation in elliptic curve protocols is the computation of point multi-
plication. This operation can be accelerated by performing efficient computation
of the underlying prime field arithmetic.
Recently, proposals of new elliptic curves defined over prime fields that accel-
erate the finite field arithmetic operations have appeared in [2,3,10]. Such pro-
posals use pseudo-Mersenne primes (p = 2m − c) which enable fast modular
reduction. One of these proposals is based on the curve named Curve25519,
which has been gained a lot of relevance due to their efficiency and secure imple-
mentation. The Curve25519 is a Montgomery elliptic curve defined over F2255−19.
On this curve, only the x-coordinate of a point P is required to compute the
x-coordinate of the point multiplication kP , for any integer k.
For the past few years, processors have benefited from increasing support
for vector instructions, which operate each instruction over a vector of data.
Armando Faz-Hernández and Julio López were partially supported by the Intel Labs
University Research Office.
Julio López was partially supported by FAPESP, Projeto Temático grant number
2013/25.977-7.
c© Springer International Publishing Switzerland 2015
K. Lauter and F. Rodŕıguez-Henŕıquez (Eds.): LatinCrypt 2015, LNCS 9230, pp. 329–345, 2015.
DOI: 10.1007/978-3-319-22174-8 18
330 A. Faz-Hernández and J. López
The AVX2 instruction set extends the capabilities of the processor with 256-bit
registers and instructions that are able to compute up to four simultaneous 64-bit
operations. Thus, it is relevant to study how to benefit from vector instructions
for the acceleration of ECC protocols. In this work, we exhibit implementation
techniques using AVX2 instructions to compute the ECDH protocol based on
Curve25519.
The rest of the document is presented as follows: in Sect. 2, the features
of the AVX2 instruction set are described; in Sect. 3, we detail the prime field
arithmetic for pseudo-Mersenne primes; in Sect. 4, Curve25519 is described along
with the arithmetic of the Montgomery elliptic curve; in Sect. 5, we present
the implementation techniques for the field F2255−19 using AVX2 instructions;
in Sect. 6, the performance results of our implementation are summarized; and
finally in Sect. 7, we present the conclusions of this work.
2 The AVX2 Instruction Set
An interesting trend of micro-architecture design is the Single Instruction Mul-
tiple Data (SIMD) processing; in this setting, processors contain a special bank
of vector registers and associated vector instructions, which are able to compute
an operation on every element stored in the vector register. Since 1997, SIMD
processing has been present on processors; first, starting with the MMX instruc-
tion set [12] which contains 64-bit vectors; and then followed by a number of
Streaming SIMD Extensions (SSE) instruction sets [13] that extended the size
of vector registers to 128 bits.
In 2011, the Advanced Vector eXtensions (AVX) instruction set was released.
It extended the size of vector registers to 256 bits. However, most of the AVX
instructions were focused on the acceleration of floating point arithmetic target-
ing applications for graphics and scientific computations, postponing the integer
arithmetic instructions to later releases. Therefore, in 2013 Intel released the
Haswell micro-architecture with the AVX2 instruction set [14], which contained
plenty of new instructions not only to support integer arithmetic, but also to
compute other versatile operations. For the purpose of this work, we detail the
most relevant AVX2 instructions used, which will be referred by a mnemonic
described in Table 4 (in Appendix A):
– Logic. The Xor and And instructions were extended to operate over every
bit in a 256-bit register. The Align instruction concatenates simultaneously
the lower (and higher) 128-bit parts from two 256-bit registers into a 256-bit
temporary result, then shifts the result right by a multiple of 8 bits, and stores
the lower (and higher) 128 bits in a destination register.
– Integer addition/subtraction (Add/Sub). AVX2 extended the integer addi-
tion and subtraction instructions from SSE and AVX to 256-bit vectors, thus
enabling the computation of four simultaneous 64-bit operations. On AVX2
both addition and subtraction are unable to handle input carry and borrow.
– Integer multiplication (Mul). AVX2 is able to compute four products of 32×32
bits, storing four 64-bit results on a 256-bit vector register.
Fast Implementation of Curve25519 Using AVX2 331
– Variable shifts. Former instruction sets were able to compute logical shifts
Shl/Shr using the same fixed (resolved at compile time) shift displacement
for every word stored in the vector register. Now, AVX2 added the new
Shlv/Shrv variable shift instructions; thus, the displacement used for each
word can be determined at run time. This feature adds more flexibility for the
implementation of asymmetric operations over vector registers.
– Combination. The Blend instruction fills the content of a vector register with
the words from two different register sources chosen through a binary selection
mask register; such mask can be defined either at compile or run time. The
Unpck instruction sets a register with the interleaved words of two registers.
– Permutation. The Perm, Bcast and Perm128 instructions move the words
stored in a 256-bit vector register using a permutation pattern that can be
specified either at compile or at run time.
In terms of performance, it is worth to say that the 4× speedup factor
expected for 64-bit operations using AVX2 can be attained only for some instruc-
tions; in practice, factors like the execution latency of the instruction, the number
of execution units available and the throughput of every instruction reduce the
acceleration.
3 Prime Field Arithmetic Using Pseudo-Mersenne
Primes
This section describes the techniques used for the efficient computation of the
prime field arithmetic using a pseudo-Mersenne prime modulus. First, the rep-
resentation of elements in a prime field is detailed; we then show how to perform
each prime field operation under such a representation.
3.1 Representation of Prime Field Elements
Given an integer n (e.g. the size of machine registers), a commonly used approach











this representation an element is stored using s words of n bits. Multiprecision
representation has been widely used in several multiprecision and cryptographic
libraries [1,17,23].
However, one of the disadvantages of using a multiprecision representation
on an n-bit architecture is that some arithmetic operations impose a sequential
evaluation of integer operations; e.g. in the modular addition, the carry bits must
be propagated from the least to the most significant coefficient, and this behavior
limits the parallelism level of the computations. Since AVX2 has no support for
332 A. Faz-Hernández and J. López
additions with carry, then a representation that minimizes the propagation of
carry bits is required.
The redundant representation meets the criteria, because it relies on the selec-
tion of a real number n′ < n; thus each word will have enough bits to store the
carry bits produced by several modular additions. Thus, a field element a ∈ Fp
in this representation is denoted by the tuple of coefficients A = {as′−1, . . . , a0}






where a ≡ A(n′) mod p for n′ ∈ R and s′ = ⌈ mn′
⌉
. The fact that n′ is a non-
integer number produces that every coefficient ai has an asymmetric amount of
bits βi = n′(i + 1) − n′i for i ∈ [0, s′).
The redundant representation introduces a significant improvement in the
parallel execution of operations, a proof of such an acceleration was reported
in [3], where the author proposed to use n′ = 25.5 for speed up the elliptic curve
arithmetic over F2255−19.
3.2 Prime Field Operations
In order to compute prime field operations, the operands must be converted from
binary to redundant representation and, at the end of the whole computation,
the result must be converted back to binary.
Addition/Subtraction. Given two tuples A and B, the operation R = A±B
can be computed by performing the addition/subtraction coefficient-wise, e.g.
ri = ai ± bi for i ∈ [0, s′). Notice that these operations are totally independent
and admit a parallel processing provided that no overflow occurs.
Multiplication. The computation of a prime field multiplication is usually
processed in two parts: the integer multiplication and then the modular reduc-
tion; however as a pseudo-Mersenne prime (p = 2m − c) is used to define the
finite field then both operations can be computed in the same step. Therefore,




(2ηj,tδj,t)ajbt for i ∈ [0, s′) and t = i − j mod s′, (3)
where the terms δx,y and ηx,y are constants defined as follows:
δx,y =
{
c if x + y ≥ s′,
1 otherwise.
(4)
ηx,y = xn′ + yn′ − (x + y mod s′)n′ mod m. (5)
Since ∀i ∈ [0, s′) βi ≤ n′ is true, this implies that the products in ri on Eq. 3
will not overflow the 2n-bit boundary for some n′ < n. Additionally, whenever
Fast Implementation of Curve25519 Using AVX2 333
δx,y 	= 1 denotes those products that were moved around to the corresponding
power of two because of modular reduction; and ηx,y 	= 1 indicates that some
products must be adjusted as a consequence of n′ not being an integer.
Squaring. Following the same idea for multiplication; in the squaring some
products appear twice and then can be replaced with multiplications by 2, as
denoted by term νx,y = 2 if x 	= y, otherwise νx,y = 1. Given a tuple A, the









(2ηj,tδj,tνj,t)ajat for i ∈ [0, s′), t = i − j mod s′, (6)
Coefficient Reduction. Every time an addition, a subtraction, a multiplication
or a squaring is computed, the result fits on s′ words of n bits, so it is possible
to continue processing more additions and subtractions. However, if the result of
the operation is the input of a multiplication or of a squaring, then a coefficient
reduction must be processed.
The coefficient reduction over a tuple A is an operation that ensures that
every coefficient ai verifies the following condition: |ai| ≤ βi + 1, where βi was
defined in the previous section. This operation keeps the size of coefficients under
a safe range to process another modular operation without overflowing registers.
Given a tuple A, every coefficient is splitted into three parts, namely ai = hi ‖
mi ‖ li, where |li| = βi, |mi| = βi+1 mod s′ , |hi| = n−|li|−|mi| for i ∈ [0, s′), thus
the coefficient reduction is computed as follows: a′0 = l0 + t0, a
′
1 = l1 + m0 + t1,
and a′i = li +mi−1 +hi−2 for i ∈ [2, s′), where the terms t0 and t1 are computed
using the following equation: t12β0 + t0 = c · (hs′−12β0 + ms′−1 + hs′−2).
Multiplicative Inverse. In order to compute the multiplicative inverse of an
element a ∈ F∗p, the following identity is used: a−1 ≡ ap−2 (mod p); part of
this exponentiation can be calculated using an addition chain as shown by Itoh-
Tsujii in [18]. Let x, y ∈ Z+ and x ≤ y, define the term αx = a2x−1 and the
relation αx → αy as αy = (αx)2y−xαy−x. In [3] was given an addition chain for
F2255−19, starting with α5 → α10 → α20 → α40 → α50 → α100 → α200 → α250,




multiplications, 254 squarings and 265 coefficient reductions.
4 Elliptic Curve Diffie-Hellman on Curve25519
4.1 Safe Elliptic Curves
Around 2000, the National Institute of Standards and Technology (NIST) stan-
dardized a set of elliptic curves and associated parameters of finite fields to
provide elliptic curve cryptography schemes for different security levels [20].
The selected elliptic curves were defined over both binary and prime fields. For
the case of prime fields, prime numbers were selected as Generalized Mersenne
334 A. Faz-Hernández and J. López
Table 1. Recent proposals of elliptic curves for three different security levels.
Security Level Elliptic Curve Prime Number (p)
128 NIST-P256 2256 − 2224 + 2192 + 296 − 1
Curve25519 2255 − 19
Curve1174 2251 − 9
192 NIST-P384 2384 − 2128 − 296 + 232 − 1
M-383 2383 − 187
E-382 2382 − 105
256 NIST-P521 2521 − 1
M-511 2511 − 187
E-521 2521 − 1
primes, defined by Solinas in [22]; these primes have the property of allowing
faster modular reduction compared to random selected primes.
Recently, new proposals have appeared for different elliptic curves which
associate a different construction of prime modulus, such as [2,3,10]. The pro-
posals have in common the use of pseudo-Mersenne primes (p = 2m − c), with
m being close to twice the targeted security level, and c as small as possible for
the acceleration of prime field operations.
Nowadays, the study of the prime field implementation not only impacts on the
efficiency of the cryptographic protocols but also on its security. An implementa-
tion of prime field arithmetic could cause leakage of secret information when is not
implemented properly. Recently, Bernstein and Lange started the project called
SafeCurves [7] with the aim to ensure elliptic curve cryptography security through
the design of simple and secure implementations. The SafeCurves project evalu-
ates the fulfillment of some security criteria over several elliptic curves. Table 1
lists some of the recent proposals of elliptic curves, notice that prime numbers
selected are simpler than those from NIST’s recommendation.
4.2 Arithmetic of Curve25519
Focussing on the 128-bit security level, the elliptic curve named Curve25519
has attracted some attention due to its efficient and secure implementation. For
example, it has been proposed for inclusion in the DNS protocol (DNSCurve
project [5]); additionally, the OpenSSH library has chosen Diffie-Hellman over
Curve25519 as the default key-exchange protocol [21].
Curve25519 was proposed by Bernstein [3] for the acceleration of the ellip-
tic curve Diffie-Hellman protocol targeting 128-bit security level. This curve is
defined over the prime field F2255−19 and has the following form:
Curve25519: y2 = x3 + â2x2 + x, â2 = 486662. (7)
This curve belongs to the family of Montgomery elliptic curves, which were
used in [19] to accelerate the elliptic curve method for factoring (ECM).
Fast Implementation of Curve25519 Using AVX2 335
Algorithm 1. Ladder Step Algorithm Tuned for SIMD Processing
Input: XP−Q, ZP−Q, XP , ZP , XQ, ZQ ∈ Fp and the coefficient â2 from Eq. (7).
Output: X2P , Z2P , XP+Q, ZP+Q ∈ Fp.
1: A ← XP + ZP
2: B ← XP − ZP
3: DA ← A × D
4: t1 ← DA + CB
5: t1 ← t21
6: XP+Q ← t1 × ZP−Q
7: A′ ← A2
8: A′x ← 1
4
(â2 + 2) · A′
9: E ← A′ − B′
10: X2P ← A′ × B′
C ← XQ + ZQ [add]
D ← XQ − ZQ [sub]
CB ← C × B [mul]
t0 ← DA − CB [add/sub]
t0 ← t20 [sqr]
ZP+Q ← t0 × XP−Q [mul]
B′ ← B2 [sqr]
B′y ← 1
4
(â2 − 2) · B′ [mul-cst]
F ← A′x − B′y [sub]
Z2P ← E × F [mul]
11: return X2P , Z2P , XP+Q, ZP+Q.
In the same paper, Montgomery devised an algorithm to efficiently compute the
x-coordinate of kP using only the x-coordinate of the point P ; the technique
uses the projective representation of points on the curve.
The computation of point multiplication using Montgomery Ladder algo-
rithm is shown in Algorithm 6 in Appendix B.3. For each bit of the scalar, the
procedure updates the values of two points P and Q through the ladder step
algorithm (Algorithm 1), which computes a point doubling of P and a differen-
tial point addition of P and Q. The results are conditionally stored in temporary
registers depending on a bit of the scalar k; the conditional update must be pro-
tected to avoid leaking the bits of k using either arithmetic or logic operations.
Finally, the affine version of the x-coordinate of Q is recovered.
5 The AVX2 Implementation
This section starts discussing some performance penalties encountered in AVX2,
then we describe some ways of getting a better performance through parallel
computations in the Montgomery ladder algorithm, and finally we will show the
techniques used to implement the prime field F2255−19 with AVX2 instructions.
5.1 Performance Challenges Using AVX2
Before proceeding to the implementation of prime field operations, we detail a
relevant issue on the implementation of modular multiplication. Recall that the
AVX2 instruction Mul is able to process four integer multiplications of 32 × 32
bits, so in order to compute some products of the modular multiplication the
natural approach is to pack four consecutive products. However, the way that the
products are packed is critical in terms of performance; as an illustrative exam-
ple, we present two cases that compute four products required in the modular
multiplication, under the assumption that there are four registers initialized with
336 A. Faz-Hernández and J. López
the following values: R0 = [a3, a2, a1, a0], R1 = [b3, b2, b1, b0], R2 = [b7, b6, b5, b4]
and R3 = [,, b9, b8].
Example 1. To calculate the vector [a0b3, a0b2, a0b1, a0b0], only two instructions
are required: first, we fill a register with a0 using the Bcast instruction and
then we apply the Mul instruction with the register R1.
Example 2. Computing [a3b0, a3b9, a3b8, a3b7] requires the following set of oper-
ations:
X ← Bcast(R0) [a3, a3, a3, a3]
Y ← Blend(R1, R2, 1100) [b7, b6, b1, b0]
Y ← Perm(Y ) [b0, b6, b1, b7]
U ← Perm(R3) [, b9, b8,]
Y ← Blend(Y,U, 0110) [b0, b9, b8, b7]
Z ← Mul(X,Y ) [a3b0, a3b9, a3b8, a3b7]
In the second example, the computation takes 5 instructions just to place
the operands in the right position to be multiplied, while in the first example
it only takes 1 instruction. Products arranged as in the second example appear
more frequently in the computation of the modular multiplication; although we
could compute them using permutation instructions, the use of these instructions
impacts negatively on the performance of the operations.
The high latency of permutation instructions is the result of the architectural
design of Haswell micro-architecture. The previous instruction sets (SSE and
AVX) operate with an execution network that computes vector instructions on
128-bit registers. On the other hand, Haswell contains an additional network
of 128-bit registers to represent the higher part of a 256-bit register, so both
networks compute in parallel most of the AVX2 instructions. Consequently, any
data transfer between such networks will incur a performance penalty.
5.2 The SIMD Montgomery Ladder
Since an efficient implementation of the prime field will improve the elliptic curve
arithmetic; so, we also focus on the flow of operations in the curve level. Ana-
lyzing the ladder step algorithm, we noticed that there are several opportunities
to compute two prime field operations in parallel without dependency between
the elements involved in the operation.
The general idea is simple: it is possible to compute two prime field opera-
tions by packing the operands in the lower and higher parts of a 256-bit vector
register, thus the arithmetic operations will be computed on both parts at the
same time. At this point, some natural questions are raised: why do we not use
a 4-way version in the evaluation of the operations? The answer comes from the
evaluation of Montgomery ladder step algorithm, which processes a nearly sym-
metrical computation over two sets of prime field elements; this does not restrict
the use of, for example, a 4-way version applied to computations with four inde-
pendent operations in other scenarios. A second natural question is: using 2-way
Fast Implementation of Curve25519 Using AVX2 337
prime field operations, are the benefits brought by the use of vector instructions
lost? Working in this scenario with 256-bit registers, each prime field operation
still takes advantage from the use of 128-bit registers.
A parallel computation of the ladder step algorithm was suggested in [11]: one
of the parallel units will compute the point doubling while the other unit will
produce the differential point addition. Another interesting idea is scheduling
operations in a SIMD fashion, which was demonstrated to be efficient on the imple-
mentation presented in [9]; such an implementation takes advantage of the use of
the NEON instructions to compute two finite field operations independently.
In this work, we go further by exploiting the parallelism at two levels: first
at high level, the SIMD execution of prime field operations; and at low level,
the computations inside of the prime field operation can also benefit from SIMD
execution. The right-hand side of Algorithm 1 lists the operations computed
in each row; as one may notice, the same operation is applied to two different
data sets exhibiting exactly the spirit behind the SIMD paradigm. The way that
the ladder step algorithm was presented in Algorithm 1 gives an insight of the
register allocation and of the scheduling of operations.
5.3 Implementation of F2255−19
For the implementation of F2255−19 the most efficient approach is to set n′ = 51
on a 64-bit architecture and n′ = 25.5 for a 32-bit architecture. As was mentioned
before, Bernstein in [3] encouraged the use of n′ = 25.5; such a implementation
used the floating point registers to emulate integer arithmetic operations because
in that scenario a double precision floating point register is able to store a 53-bit
number without loss of information. In our scenario, we also choose n′ = 25.5 as
working on a 32-bit architecture, because the wider multiplier available in AVX2
is a 32-bit multiplier, even though other fundamental integer operations support
64-bit operands.
Summarizing the parameters described in Sect. 3.1, our implementation of
F2255−19 sets n′ = 25.5, s′ = 10, n = 32 for integer multiplication and n = 64 for
integer, shift and logic operations.
Interleaving Tuples. As it was shown in the previous section, the ladder step
function computes two field operations at each time. We denote by 〈A,B〉 the
interleaved tuples A and B which represents five 256-bit registers, such that
〈A,B〉i = [a2i+1, a2i, b2i+1, b2i] for i ∈ [0, 5). Thus, two coefficients from tuple
A will be stored in the higher 128-bit register and two coefficients from tuple B
will be stored in the lower 128-bit register.
Addition/Subtraction. The addition and subtraction operations require only
five addition (Add) or subtraction (Sub) instructions, respectively.
Multiplication. Algorithm 2 shows the computation of two interleaved tuples
〈A,C〉 and 〈B,D〉. Values on the right hand side show the content stored in the
destination register. The lines 2 to 5 compute the multiplication by the ηx,y term
338 A. Faz-Hernández and J. López
using variable shift instructions. In the main loop (lines 6–18), a temporary regis-
ter U contains the first and third words of 〈A,C〉i in the lower and higher 128-bit
parts of the register, respectively; this task can be efficiently performed using a
Shuf instruction. Once that U was computed, in the inner loop U will be multi-
plied by 〈B,D〉j and the result will be accumulated in Zi+j . Analogously to U , the
products resulting from the V register will be accumulated in Zi+j+1. Register W
contains some products that must be accumulated in either Zi or Zi+5, thereby
a Blend instruction masks the appropriate products to be added. Finally, the
products for which δx,y = 19 will be multiplied using shift instructions.
Algorithm 2. Instruction scheduling to compute a modular multiplication in
F2255−19 using AVX2 instructions.
Input: Two interleaved tuples 〈A,C〉 and 〈B,D〉.
Output: An interleaved tuple 〈E,F〉 such that E = A × B and F = C × D.
1: Zi ← 0 for i ∈ [0, 10)
2: for i ← 0 to 4 do
3: 〈B′,D′〉i ← Align(〈B,D〉i+1 mod 5, 〈B,D〉i) [b2i+2, b2i+1, d2i+2, d2i+1]
4: 〈B′,D′〉i ← Shlv(〈B′,D′〉i, [0, 1, 0, 1]) [b2i+2, 2b2i+1, d2i+2, 2d2i+1]
5: end for
6: for i ← 0 to 4 do
7: U ← Shuf(〈A,C〉i, 0) [a2i, a2i, c2i, c2i]
8: for j ← 0 to 4 do
9: Zi+j ← Add(Zi+j ,Mul(U, 〈B,D〉j)) [a2ibj+1, a2ibj , c2idj+1, c2idj ]
10: end for
11: V ← Shuf(〈A,C〉i, 1) [a2i+1, a2i+1, c2i+1, c2i+1]
12: for j ← 0 to 3 do
13: Zi+j+1 ← Add(Zi+j+1,Mul(V, 〈B′,D′〉j))
[a2i+1b2j+2, 2a2i+1b2j+1, c2i+1d2j+2, 2c2i+1d2j+1]
14: end for
15: W ← Mul(V, 〈B′,D′〉4) [a2i+1b0, 2a2i+1b9, c2i+1d0, 2c2i+1d9]
16: Zi ← Add(Zi,Blend(W, [0, 0, 0, 0], 0101)) [a2i+1b0, 0, c2i+1d0, 0]
17: Zi+5 ← Add(Zi+5,Blend(W, [0, 0, 0, 0], 1010)) [0, 2a2i+1b9, 0, 2c2i+1d9]
18: end for
19: for i ← 0 to 4 do
20: 19Zi+5 ← Add(Add(Shl(Zi+5, 4), Shl(Zi+5, 1)), Zi+5)
21: 〈E,F〉i ← Add(Zi, 19Zi+5)
22: end for
23: return 〈E,F〉
Squaring/Coefficient Reduction. The description of the instruction schedul-
ing for these operations can be found in Appendices B.1 and B.2, respectively.
Conditional Swapping. The Montgomery point multiplication requires the
conditional swapping of register values depending on the bits of the scalar; usu-
ally, the scalar represents a secret key, thereby this operation must be computed
without branches and must run in constant time. In order to implement these
requirements, the conditional swapping is computed using logic operations as
shown in Algorithm 3.
Fast Implementation of Curve25519 Using AVX2 339
Algorithm 3. Conditional Swapping.
Input: b ∈ {0, 1} a conditional bit, X and Y two registers to be swapped.
Output: X ← Y and Y ← X if b = 1, otherwise remain unchanged.
1: M ← Bcast(−b)
2: T ← And(Xor(X, Y ), M)
3: X ′ ← Xor(X, T )
4: Y ′ ← Xor(Y, T )
5: return X ′, Y ′
6 Performance Results
Benchmarking was performed on a Haswell processor (Core i7-4770) at 3.4 GHz,
where the Intel Turbo Boost and Intel Hyper Threading technologies were dis-
abled. Our code was compiled using the GNU C Compiler v4.9.0 and timings
were measured as the average time of 106 and 104 computations for prime field
operations and point multiplication, respectively.
Prime Field Operations. Table 2 shows the performance of the field arith-
metic operations using AVX2 instructions. The first row exhibits the clock cycles
required to compute one single arithmetic operation over a tuple A; the second
row represents the clock cycles used to compute two simultaneous arithmetic
operations over interleaved tuples 〈A,B〉; and the last row shows the speedup
factor obtained by the 2-way against the single implementation.
The acceleration of the 2-way operations was achieved by minimizing the
use of permutation instructions and working with interleaved tuples. Recently,
an algorithm to compute a modular multiplication on the field F2521−1 using a
redundant representation was presented in [16]. This algorithm requires 12s
′(s′ +
1) word multiplications and 2(s′2 − 1) word additions. The paper also shows
a formulation for the field F2255−19; and following this idea, we implemented
a 2-way multiplier whose performance was 117 clock cycles, and this result is
48 % slower than our schoolbook 2-way multiplier that takes 79 clock cycles. The
main issue observed was an overhead produced by arranging the vectors to be
multiplied, as permuting words between registers is costly.
Elliptic Curve Diffie-Hellman. In order to illustrate the performance of
our software implementation of Elliptic Curve Diffie-Hellman protocol using
Curve25519, we follow the guidelines presented in [4] to implement the following
algorithms:
– Key Generation. Let G be the generator point of the Curve25519 where
x(G) = 9, the key generation algorithm computes a public key x(kG) given a
secret key k ∈ [0, 2256).
– Shared Secret. Given the x-coordinate of the public key, x(P ), and a secret
key k, the shared secret algorithm computes the x-coordinate of kP .
Table 3 shows the timings obtained by our implementation and compares the
performance against the state-of-the-art implementations. For the key generation
340 A. Faz-Hernández and J. López
Table 2. Cost in clock cycles to compute one prime field operation using single imple-
mentation, and two prime field operations using the 2-way implementation.
Addition Multiplication Squaring Coefficient Inversion
Subtraction Reduction
Single (1 op) 4 57 47 26 16,500
2-way (2 ops) 8 79 57 33 21,000
Speedup Factor 1× 1.44× 1.64× 1.57× 1.57×
Table 3. Timings obtained for the computation of the Elliptic Curve Diffie-Hellman
protocol. The entries represent 103 clock cycles. The timings in the rows with (α) were
measured on our Core i7-4770 processor and the rest of the entries were taken from
the corresponding references.
Implementation Processor Key Generation Shared Secret
NaCl [8] Core i7-4770 (α) 261.0 261.0
amd64-51 [6] Core i7-4770 172.6 163.2
amd64-51 [6] Xeon E3-1275 V3 169.9 161.6
Our work Core i7-4770 (α) 156.5 156.5
algorithm the implementations listed in Table 3 use the same routine to compute
both key generation and shared secret.
As it can be seen from Table 3, the performance achieved in both the key
generation and the shared secret computations brings a moderate speedup com-
pared against the implementations reported in the eBACS website [6]. Notice
that non-vector implementations found in the literature take advantage of the
native 64-bit multiplier that takes 3 clock cycles; whereas, for AVX2, the same
computation is performed using a 32-bit multiplier that takes 5 clock cycles.
Additionally, the high latency of some instructions guides the optimization to
use a more reduced set of instructions.
On a side note, it is to be noted that, in the key generation algorithm, since
x(G) = 9, the modular multiplication on line 6 of Algorithm 1 can be replaced by
only a few shift instructions. In our implementation, this gives a 13.5 % speedup,
computing the key generation step in only 135.5 × 103 clock cycles.
7 Conclusions
Applying vector instructions to an implementation requires a careful knowledge
of the target architecture, thus selecting the best scheduling of instructions is not
a straightforward task because it demands a meticulous study of the instruction
set and of the architectural capabilities. On the presence of architectural issues
that limit the performance, we found a way to overcome some of them and
Fast Implementation of Curve25519 Using AVX2 341
produced an efficient implementation as fast and secure as the best optimized
implementations for 64-bit architectures.
Our main contribution is a fast implementation of the elliptic curve Diffie-
Hellman protocol based on Curve25519 with a minor improvement over the
state-of-the-art implementations. This performance was mainly achieved due to
the efficient implementation of F2255−19 using AVX2 vector instructions.
In this work, we expose the versatility of the AVX2 instructions, where the
SIMD processing was applied at two levels: at the higher level, we showed how
to schedule arithmetic operations in the Montgomery ladder step algorithm to
be computed in parallel; and at the lower level, the computation of prime field
operations also benefited from vector instructions.
We remark that the algorithms used to implement prime field arithmetic
using vector instructions can also be extended for other prime fields that use
pseudo-Mersenne primes. The applicability of these techniques to other elliptic
curve models is left as a future work. Also, it would be interesting to know how
the upcoming architectures will impact on the performance of AVX2 instructions.
In particular, the new Intel Skylake micro-architecture that has support for 512-
bit registers should promote a high applicability of the results of this work.
Acknowledgments. The authors would like to thank the anonymous reviewers for
their helpful suggestions and comments. Additionally, they would like to show their
gratitude to Jérémie Detrey for his valuable comments on an earlier version of the
manuscript.
Table 4. Latency and reciprocal throughput of some AVX2 instructions.
Type Mnemonic Assembler Latency Reciprocal
Instructions (cycles) Throughput (cycles/op)
Arithmetic Add/Sub VPADDQ/VPSUBQ 1 0.5
Mul VPMULDQ 5 1
Logic Shl/Shr VPSLLQ/VPSRLQ 1 1
Shlv/Shrv VPSLLVQ/VPSRLVQ 2 2
Align VPALIGNR 1 1
And/Xor VPAND/VPXOR 1 0.33
Combination Blend VPBLENDD 1 0.33
VPBLENDVB 2 2
Shuf VSHUFPD 1 1
Unpck VPUNPCKHQDQ 1 1
VPUNPCKLQDQ 1 1
Permutation Perm VPERMQ 3 1
Bcast VPBROADCASTQ 5 0.5
Perm128 VPERM2I128 3 1
342 A. Faz-Hernández and J. López
A Relevant AVX2 Instructions
A list of the most relevant instructions used in this work is presented. For clar-
ity, instructions were grouped according to their functionality. Table 4 shows in
the second column a mnemonic used in this document; in the third column is
described the specific assembler name of the instruction, and the last columns
show the latency and the reciprocal throughput of every instruction, the entries
were taken from the Agner Fog’s measurements published in [15].
Algorithm 4. Instruction scheduling to compute a modular squaring in F2255−19
using AVX2 instructions.
Input: An interleaved tuple 〈A,B〉.
Output: An interleaved tuple 〈E,F〉 such that E = A2 and F = B2.
1: for i ← 0 to 4 do
2: U2i ← 〈A,B〉i [a2i+1, a2i, b2i+1, b2i]
3: U2i+1 ← Align(〈A,B〉i+1 mod 5, 〈A,B〉i) [a2i+2, a2i+1, b2i+2, b2i+1]
4: U2i+1 ← Shlv(U2i+1, [0, 1, 0, 1]) [a2i+2, 2a2i+1, b2i+2, 2b2i+1]
5: V2i ← Shuf(〈A,B〉i, 0) [a2i, a2i, b2i, b2i]
6: V2i+1 ← Shuf(〈A,B〉i, 1) [a2i+1, a2i+1, b2i+1, b2i+1]
7: end for
8: for i ← 0 to 4 do
9: T ← Mul(Ui, Vi) [ai+1ai, aiai, bi+1bi, bibi]
10: Zi ← Blend(T, [0, 0, 0, 0], 1010) [0, aiai, 0, bibi]
11: W ← Blend(T, [0, 0, 0, 0], 0101) [ai+1ai, 0, bi+1bi, 0]
12: for j ← 1 to i do
13: t ← i − j mod 10
14: W ← Add(W,Mul(Uj , Vt)) [aj+1at, ajat, bj+1bt, bjbt]
15: end for
16: Zi ← Add(Zi,Shl(W, 1))
17: S ← Mul(Ui+5, Vi+5) [ai+6ai+5, ai+5ai+5, bi+6bi+5, bi+5bi+5]
18: Zi+5 ← Blend(S, [0, 0, 0, 0], 1010) [0, ai+5ai+5, 0, bi+5bi+5]
19: X ← [0, 0, 0, 0]
20: for j ← i + 1 to 4 do
21: t ← i − j mod 10
22: X ← Add(X,Mul(Uj , Vt)) [aj+1at, ajat, bj+1bt, bjbt]
23: end for
24: Zi+5 ← Add(Zi+5,Shl(X, 1))
25: end for
26: for i ← 0 to 4 do
27: 19Zi+5 ← Add(Add(Shl(Zi+5, 4), Shl(Zi+5, 1)), Zi+5)
28: 〈E,F〉i ← Add(Zi, 19Zi+5)
29: end for
30: return 〈E,F〉
Fast Implementation of Curve25519 Using AVX2 343
B Algorithms
B.1 Implementation of Modular Squaring Using AVX2
To compute the modular squaring we follow a similar approach like in the case of
modular multiplication. Algorithm 4 shows the scheduling of instructions used
to compute the modular squaring of an interleaved tuple 〈A,B〉. The products
ax,y such that νx,y = 2 are computed in the inner loops (lines 12 to 15 and 20
to 23) and once that these products were accumulated, they are multiplied by 2
using shift instructions. At the end, the lines from 26 to 29 compute the modular
reduction.
B.2 Implementation of Coefficient Reduction Using AVX2
The coefficient reduction is processed coefficient-wise. We split each coefficient
into three parts ai = hi ‖ mi ‖ li and compute the process described in Sect. 3.2.
Simultaneously, each mi (medium coefficient) is added to the correspondent li+1
(low coefficient) and to the hi−1 (high coefficient). For those coefficients that
need to be reduced modulo p, we compute the multiplication by c using just
shift instructions. After the coefficient reduction is processed, the size of each
coefficient in the updated tuple will have at most βi + 1 bits.
Algorithm 5. Instruction scheduling for computing a coefficient reduction in
F2255−19 using AVX2 instructions.
Input: An interleaved tuple 〈A,B〉.
Output: An updated interleaved tuple 〈A,B〉 such that |ai| ≤ βi +1 and |bi| ≤ βi +1
for i ∈ [0, 10).
1: for i ← 0 to 4 do
2: Li ← And(〈A,B〉i, [2β2i+1 − 1, 2β2i − 1, 2β2i+1 − 1, 2β2i − 1])
3: Mi ← Srlv(〈A,B〉i, [β2i+1, β2i, β2i+1, β2i])
4: Mi ← And(Mi, [2β2i+2 − 1, 2β2i+1 − 1, 2β2i+2 − 1, 2β2i+1 − 1])
5: Hi ← Srlv(〈A,B〉i, [β2i+1 + β2i+2, β2i + β2i+1, β2i+1 + β2i+2, β2i + β2i+1])
6: end for
7: for i ← 0 to 4 do
8: M ′i ← Align(Mi, Mi−1 mod 5)
9: end for
10: H4 ← Shrv(〈A,B〉8, [β8 + β9, β8, β8 + β9, β8])
11: U ← Add(H4,Shr(H4, 64))
12: 19U ← Add(Add(Shr(U, 4),Shr(U, 1)), U)
13: T ← And(19U, [0, 2β0 − 1, 0, 2β0 − 1])
14: S ← Shr(19U, [0, β0, 0, β0])
15: H4 ← Upck(T, S)
16: for i ← 0 to 4 do
17: 〈A,B〉i ← Add(Add(Li, M ′i), Hi−1 mod 5)
18: end for
19: return 〈A,B〉
344 A. Faz-Hernández and J. López
Algorithm 6. Point Multiplication using Montgomery Ladder.
Input: k ∈ [0, 2t) and x(P ) ∈ Fp, be the x-coordinate of an elliptic curve point P .
Output: x(Q), the x-coordinate of Q = kP .
1: Let k = (0, kt−1, . . . , k0)2
2: XP−Q ← x(P )
3: XP ← x(P ), ZP ← 1
4: XQ ← 1, ZQ ← 0
5: for i ← t − 1 to 0 do
6: b ← ki ⊕ ki+1
7: XQ, XP ← CondSwap(b, XQ, XP )
8: ZQ, ZP ← CondSwap(b, ZQ, ZP )
9: XQ, ZQ, XP , ZP ← Ladder(XP−Q, XQ, ZQ, XP , ZP )
10: end for
11: return x(Q) ← XQ(ZQ)−1
B.3 Point Multiplication Using Montgomery Ladder
Algorithm 6 shows the computation of the Montgomery point multiplication to
calculate the x-coordinate of kP given the x-coordinate of P and an integer
scalar k. This algorithm also requires the ladder step presented in Algorithm 1.
For its use in the computation of the elliptic curve Diffie-Hellman protocol
using the Curve25519, the document [4] describes an encoding for the secret key
when is given as a string of bytes. Then, the description of Algorithm 6 assumes
that the secret key was already encoded.
References
1. Aranha, D.F., Gouvêa, C.P.L.: RELIC is an Efficient LIbrary for Cryptography.
http://code.google.com/p/relic-toolkit/
2. Aranha, D.F., Barreto, P.S.L.M., Pereira, G.C.C.F., Ricardini, J.E.: A note on
high-security general-purpose elliptic curves. Cryptology ePrint Archive, Report
2013/647 (2013). http://eprint.iacr.org/
3. Bernstein, D.J.: Curve25519: new Diffie-Hellman speed records. In: Yung,
M., Dodis, Y., Kiayias, A., Malkin, T. (eds.) PKC 2006. LNCS, vol. 3958,
pp. 207–228. Springer, Heidelberg (2006)
4. Bernstein, D.J.: Cryptography in NaCl, March 2009. http://cr.yp.to/highspeed/
naclcrypto-20090310.pdf
5. Bernstein, D.J.: DNSCurve: usable security for DNS, June 2009. http://dnscurve.
org
6. Bernstein, D.J., Lange, T.: eBACS: ECRYPT benchmarking of cryptographic sys-
tems, March 2015. Accessed on 20 March 2015 http://bench.cr.yp.to/supercop.
html
7. Bernstein, D.J., Lange, T.: SafeCurves: choosing safe curves for elliptic-curve cryp-
tography (2015). Accessed 20 March 2015 http://safecurves.cr.yp.to
8. Bernstein, D.J., Lange, T., Schwabe, P.: NaCl: Networking and Cryptography
library, October 2013. http://nacl.cr.yp.to/
Fast Implementation of Curve25519 Using AVX2 345
9. Bernstein, D.J., Schwabe, P.: NEON Crypto. In: Prouff, E., Schaumont, P.
(eds.) CHES 2012. LNCS, vol. 7428, pp. 320–339. Springer, Heidelberg (2012).
http://dx.doi.org/10.1007/978-3-642-33027-8 19
10. Bos, J.W., Costello, C., Longa, P., Naehrig, M.: Selecting Elliptic Curves for Cryp-
tography: An Efficiency and Security Analysis. Cryptology ePrint Archive, Report
2014/130 (2014). http://eprint.iacr.org/
11. Cohen, H., Frey, G., Avanzi, R., Doche, C., Lange, T., Nguyen, K., Vercauteren, F.:
Handbook of Elliptic and Hyperelliptic Curve Cryptography, (2nd edn). Chapman
& Hall/CRC (2012)
12. Corporation, I.: Intel Pentium processor with MMX technology documentation,
January 2008. http://www.intel.com/design/archives/Processors/mmx/
13. Corporation, I.: Define SSE2, SSE3 and SSE4, January 2009. http://www.intel.
com/support/processors/sb/CS-030123.htm
14. Corporation, I.: Intel Advanced Vector Extensions Programming Reference, June
2011. https://software.intel.com/sites/default/files/m/f/7/c/36945
15. Fog, A.: Instruction tables: Lists of instruction latencies, throughputs and micro-
operation breakdowns for Intel, AMD and VIA CPUs, December 2014
16. Granger, R., Scott, M.: Faster ECC over F2521−1. Cryptology ePrint Archive,
Report 2014/852 (2014). http://eprint.iacr.org/
17. Granlund, T., the GMP development team: GNU MP: The GNU Multiple Precision
Arithmetic Library, (5.0.5 edn) (2012). http://gmplib.org/
18. Itoh, T., Tsujii, S.: A fast algorithm for computing multiplicative inverses
in GF(2m) using normal bases. Inf. Comput. 78(3), 171–177 (1988).
http://dx.doi.org/10.1016/0890-5401(88)90024–7
19. Montgomery, P.L.: Speeding the pollard and elliptic curve methods of factorization.
Math. Comput. 48(177), 243–264 (1987). http://dx.doi.org/10.2307/2007888
20. National Institute of Standards and Technology: Digital Signature Standard
(DSS). FIPS Publication 186, may 1994. http://www.bibsonomy.org/bibtex/
2a98c67565fa98cc7c90d7d622c1ad252/dret
21. Shell, O.S.: OpenSSH, January 2014. http://www.openssh.com/txt/release-6.5
22. Solinas, J.A.: Generalized Mersenne Numbers. Technical report,Center of Applied
Cryptographic Research (CACR) (1999)
23. The OpenSSL Project: OpenSSL: The Open Source toolkit for SSL/TLS, April
2003. www.openssl.org
