Indian Statistical Institute

ISI Digital Commons
Journal Articles

Scholarly Publications

9-1-2020

Improved SIMD implementation of Poly1305
Sreyosi Bhattacharyya
Indian Statistical Institute, Kolkata

Palash Sarkar
Indian Statistical Institute, Kolkata

Follow this and additional works at: https://digitalcommons.isical.ac.in/journal-articles

Recommended Citation
Bhattacharyya, Sreyosi and Sarkar, Palash, "Improved SIMD implementation of Poly1305" (2020). Journal
Articles. 146.
https://digitalcommons.isical.ac.in/journal-articles/146

This Research Article is brought to you for free and open access by the Scholarly Publications at ISI Digital
Commons. It has been accepted for inclusion in Journal Articles by an authorized administrator of ISI Digital
Commons. For more information, please contact ksatpathy@gmail.com.

IET Information Security
Research Article

Improved SIMD implementation of Poly1305

ISSN 1751-8709
Received on 20th November 2019
Revised 6th March 2020
Accepted on 24th March 2020
E-First on 27th April 2020
doi: 10.1049/iet-ifs.2019.0605
www.ietdl.org

Sreyosi Bhattacharyya1, Palash Sarkar1
1Applied

Statistics Unit, Indian Statistical Institute, 203, B.T. Road, Kolkata 700108, India
E-mail: palash@isical.ac.in

Abstract: Poly1305 is a polynomial hash function designed by Bernstein in 2005. Presently, it is part of several major platforms,
including the Transport Layer Security protocol. Vectorised implementation of Poly1305 was proposed by Goll and Gueron in
2015. The authors provide some simple algorithmic improvements to the Goll–Gueron vectorisation strategy. Implementation of
the modified strategy on modern Intel processors shows marked improvements in speed for short messages.

1 Introduction
Confidentiality and integrity of data flowing through the Internet
are of paramount importance. The Transport Layer Security (TLS)
protocol is used to provide security of Internet communications.
TLS includes a variety of algorithms for different cryptographic
functionalities. Among the algorithms which are part of TLS is
Poly1305 [1] (in combination with ChaCha [2]). Apart from TLS,
Poly1305 is part of various other cryptographic libraries: it has
been standardised by the IETF and is part of the NaCl [3] library
(in combination with Salsa20 [4]). Further, Google uses ChaChaPoly1305 to secure communication between Chrome browsers on
Android and Google websites. The Wikipedia page (https://
en.wikipedia.org/wiki/Poly1305, accessed on 12 February 2020)
for Poly1305 provides further details about the real-world
deployment of Poly1305. Given the widespread use of Poly1305,
efficient software implementation of the algorithm is an important
practical issue.
Modern processors are moving towards providing vector
instructions. These instructions allow single-instruction, multipledata (SIMD) implementations of a variety of algorithms which
often provide substantial speed gains over sequential
implementations. The importance of vectorisation in modern
processors has been highlighted in a post by Bernstein (https://
groups.google.com/a/list.nist.gov/forum/
#!searchin/pqc-forum/
vectorization%7Csort:date/pqc-forum/mmsH4k3j_1g/
JfzP1EBuBQAJ, accessed on 12 February 2020).
The present work considers the issue of SIMD implementation
of Poly1305 using vector instructions. To the best of our
knowledge, the first such implementation of Poly1305 was done by
Goll and Gueron [5]. They showed a way to divide the Poly1305
computation into d independent and parallel computation streams.
Concrete implementations were provided in [5] for d = 4 and
d = 8 on modern Intel processors.
Suppose d = 4. The top-level view of the Goll–Gueron
algorithm is as follows. The message is formatted into blocks. The
algorithm processes four blocks at a time. If the number of blocks
in the message is a multiple of four, then the algorithm uniformly
processes all the blocks. On the other hand, if the number of blocks
is not a multiple of four, then at the end, the parallelism breaks
down, and the tail of the message consisting of one to three blocks
has to be processed separately.
We provide a simple idea to improve the Goll–Gueron
algorithm. The Poly1305 algorithm essentially computes a
polynomial over a finite field whose coefficients are the blocks of
the message. Prepending the message with some zero blocks (i.e.
blocks corresponding to the zero element of the field), the output of
the Poly1305 algorithm remains unchanged. We take advantage of
IET Inf. Secur., 2020, Vol. 14 Iss. 5, pp. 521-530
© The Institution of Engineering and Technology 2020

this feature by prepending the message with one to three zero
blocks so that overall the number of blocks in the message is a
multiple of four. Then the processing of the blocks can be done
four at a time in a uniform manner.
The above strategy opens up a further opportunity for
improvement. Suppose that the number of blocks in the message is
1 modulo 4. Then three zero blocks would need to be prepended.
Consider the initial 4-way multiplication. This consists of three
zero blocks and one message block. So, applying a general 4-way
multiplication routine, in this case, leads to many multiplications
by zeros. This is wasteful. We describe a new method to perform
such an initial multiplication, which is much faster than a general
4-way multiplication. Similarly, we extend this to the case where
the number of blocks in the message is 2 modulo 4. In the case
where the number of blocks is 3 modulo 4, there is no advantage in
trying to reduce the number of multiplications using a new initial
multiplication algorithm.
We have modified the code accompanying the Goll–Gueron
paper to obtain an implementation of the 4-way vectorisation
strategy outlined above. For message lengths up to 4000 bytes, this
leads to significant speed improvements for messages whose
number of blocks are not multiples of four. Comparative speed
measurements of the new algorithm and the Goll–Gueron
algorithm have been made on the Haswell, Kaby Lake and Skylake
processors.
Goll and Gueron [5] also describe an 8-way vectorisation
strategy. Our idea of simplifying the parallelism and improving the
initial multiplication extends to the 8-way vectorisation. More
generally, our algorithmic improvement over the Goll–Gueron
strategy applies to all processors which support vector instructions.

2 Description of Poly1305 hash function
Let p = 2130 − 5 and F p be the finite field of p elements. The
Poly1305 hash function maps a message into an element of F p. In
the original description [1] of Poly1305, the message is a
sequence of bytes. The later work [5] considered the message to be
a sequence of bits; if the number of bits in the message is a
multiple of 8, then the description in [5] coincides with the original
description in [1]. Goll and Gueron [5] provide a description of
Poly1305, where a message is a sequence of bits.
Suppose, a message M consists of L ≥ 0 bits. If L = 0, define ℓ
to be 0; otherwise, let ℓ ≥ 1 be such that L = 128(ℓ − 1) + r,
where 1 ≤ r ≤ 128. Write the message as a concatenation of ℓ
strings, i.e., M = M0 ∥ ⋯ ∥ Mℓ − 1 such that M0, …, Mℓ − 2 each has
length 128 bits and Mℓ − 1 has length r bits. For i = 0, …, ℓ − 2,
define Ci = Mi ∥ 1, and Cℓ − 1 = Mℓ − 1 ∥ 10128 − r. This ensures that
521

(C0, …, Cℓ − 1) into d ≥ 2 subsequence and apply Horner's rule to
each of the subsequence. This allows alternatively performing d
simultaneous multiplications and d simultaneous additions. Such a
strategy has been called d-decimated Horner evaluation [6]. Goll
and Gueron [5] described SIMD implementations of Poly1305
based on d-decimated Horner evaluation. They considered two
values of d, namely, d = 4 and d = 8 leading to 4-way and 8-way
SIMD implementations, respectively. We provide details for d = 4,
the case of d = 8 being similar.
Let ρ = ℓ mod 4 and ℓ′ = (ℓ − ρ)/4. The computation of
Poly1305R(C0, …, Cℓ − 1) can be done in the following manner. Let
Fig. 1 Algorithm 1: Structure of Goll–Gueron 4-way vectorisation of
Poly1305 computation. Refer to (7) for the definition of the vector
quantities

the length of Ci is 129 bits for i = 0, …, ℓ − 1. Let format(M) be
the map from a message M to (C0, …, Cℓ − 1).
From the above description, Ci is the binary representation
(written with the least significant bit on the left) of an integer
which is less than 2129. For the convenience of notation, we will
identify the binary string Ci with the integer it represents. Note that
the Ci’s cannot take all the values in the set {0, …, 2129 − 1}; in
particular, none of the Ci’s can be zero.
The Poly1305 hash function uses a key R which is an element
of F p. The specification of Poly1305 requires some of the bits R to
be set to zero. This was done for efficiency purposes. For the
SIMD implementation that we consider, the setting of certain bits
of R to be zero does not either help or hamper the efficiency. So,
we skip the details of the exact form of R which are given in [1].
The Poly1305 hash function is defined as follows. Given a
message M consisting of L ≥ 0 bits and a key R ∈ F p, the output of
Poly1305R(M) is defined as follows:
Poly1305R(M)
= C0Rℓ + C1Rℓ − 1 + ⋯ + Cℓ − 2R2 + Cℓ − 1R

(1)

where
(C0, …, Cℓ − 1) = format(M).
Since
the
map
format: M → (C0, …, Cℓ − 1) is injective, by an abuse of notation,
instead
of
writing
Poly1305R(M),
we
will
write
Poly1305R(C0, …, Cℓ − 1). Note that if M is the empty string, i.e.
L = 0, then ℓ = 0 and so Poly1305R(M) is 0 (the zero element of
F p).

3 Goll–Gueron SIMD implementation

= C0Rℓ − 1 + ⋯ + Cℓ − 2R + Cℓ − 1 .

(2)

So
Poly1305R(C0, …, Cℓ − 1)
= R ⋅ polyR(C0, …, Cℓ − 1) .

= ((⋯(((C0R + C1)R) + C2)R + ⋯)R + Cℓ − 1 .

(3)

(5)

+R polyR4(C3, C7, C11, …, C4ℓ′ − 1) .
Then
Poly1305R(C0, …, Cℓ − 1)
=

P
R polyR(P + Cℓ − ρ, Cℓ − ρ + 1, …, Cℓ − 1)

if ρ = 0;
if ρ > 0.

(6)

Define
⊤

R = R4, R3, R2, R ,
(7)

R4 = (R4, R4, R4, R4)⊤,
⊤

Ci = (C4i, C4i + 1, C4i + 2, C4i + 3) ,

for i = 0, 1, …, ℓ′ − 1.

The computation in (6) is described in vector form in Algorithm 1
(see Fig. 1). In the description of Algorithm 1 (Fig. 1), a temporary
vector T = (T 0, T 1, T 2, T 3) is used and ∘ denotes the Hadamard (i.e.
component-wise) product of vectors. The quantity P is a temporary
field element.
3.1 Vector multiplication
Recall that p = 2130 − 5 and so any element of F p can be
represented using 130 bits. Let θ = 226. An element X ∈ F p can be
written in the base θ as follows:

where 0 ≤ x0, …, x4 ≤ 226 − 1. Then (x4, x3, x2, x1, x0) is called a 5limb representation of X.
Let (x4, x3, x2, x1, x0) and (y4, y3, y2, y1, y0) be 5-limb representations
of X and Y, respectively. The product X ⋅ Y mod p is computed in
two steps.
Multiplication step: Z = z0 + z1θ + ⋯ + z4θ4 is obtained where
z0, …, z4 are defined as follows:

z1 = x0 ⋅ y1 + x1 ⋅ y0 + 5 ⋅ x2 ⋅ y4 + 5 ⋅ x3 ⋅ y3 + 5 ⋅ x4 ⋅ y2
z2 = x0 ⋅ y2 + x1 ⋅ y1 + x2 ⋅ y0 + 5 ⋅ x3 ⋅ y4 + 5 ⋅ x4 ⋅ y3

(4)

This requires ℓ − 1 multiplications and ℓ − 1 additions over F p. As
a result, Poly1305 can be computed using ℓ multiplications and
ℓ − 1 additions over F p.
Horner's rule is a sequential method of evaluation. One way to
exploit parallelism in the computation is to divide the sequence
522

+R2 polyR4(C2, C6, C10, …, C4ℓ′ − 2)

z0 = x0 ⋅ y0 + 5 ⋅ x1 ⋅ y4 + 5 ⋅ x2 ⋅ y3 + 5 ⋅ x3 ⋅ y2 + 5 ⋅ x4 ⋅ y1

The definition of poly in (2) permits the computation of the output
using Horner's rule in the following manner:
polyR(C0, …, Cℓ − 1)

+R3 polyR4(C1, C5, C9, …, C4ℓ′ − 3)

X = x0 + x1θ + x2θ2 + x3θ3 + x4θ4

Given C0, …, Cℓ − 1 ∈ F p and R, polyR(C0, …, Cℓ − 1) is defined as
follows:
polyR(C0, …, Cℓ − 1)

P = R4 polyR4(C0, C4, C8, …, C4ℓ′ − 4)

(8)

z3 = x0 ⋅ y3 + x1 ⋅ y2 + x2 ⋅ y1 + x3 ⋅ y0 + 5 ⋅ x4 ⋅ y4
z4 = x0 ⋅ y4 + x1 ⋅ y3 + x2 ⋅ y2 + x3 ⋅ y1 + x4 ⋅ y0 .
Note that each zi is less than 264. We define an operation mult,
where the vector (z0, …, z4) denotes the output of
mult((x4, …, x0), (y4, …, y0)).
Reduction step: W = w0 + w1θ + ⋯ + w4θ4 is obtained such that
W ≡ Z mod 4 and each wi can be represented using either 26 or 27
bits. By reduce(z4, …, z0), we will denote the vector (w4, …, w0).
For the details of the reduction step, we refer to [1].
IET Inf. Secur., 2020, Vol. 14 Iss. 5, pp. 521-530
© The Institution of Engineering and Technology 2020

Fig. 2 Packing of R4 and (5 ⋅ r4, 5 ⋅ r3, 5 ⋅ r2, 5 ⋅ r1) into two 256-bit
registers r0 and r1

Fig. 3 Packing of four 5-limb vectors in three 256-bit registers t0, t1 and
t2

Fig. 4 Packing the output of vecMult(R4, T) into five 256-bit registers
S0, …, S4

Table 1 Names and counts of the various Intel intrinsic
instructions required by vecMult(R4, T) along with the latency
and throughput figures on the Haswell and Skylake
processors
Intrinsic
Count (Latency, Throughput)
Skylake
Haswell
_mm256_mul_epu32
_mm256_set_epi32
_mm256_add_epi64
_mm256_permutevar8 × 32_epi32
_mm256_permute4 × 64_epi64

25
1
20
9
4

(5, 0.5)
—
(1, 0.33)
—
—

(5, 1)
—
(1, 0.5)
—
—

Fig. 5 Packing the output of vecReduce(S) into three 256-bit registers
w0, w1, w2

Suppose X is a fixed quantity and the product X ⋅ Y mod p is
required to be computed. The computation in (8) is helped by precomputing and storing (5 ⋅ x4, 5 ⋅ x3, 5 ⋅ x2, 5 ⋅ x1) along with the 5limb representation (x4, x3, x2, x1, x0) of X.
Vector multiplication: Algorithm 1 (Fig. 1) requires the vector
multiplication
R4 ∘ T = (R4 ⋅ T 0, R4 ⋅ T 1, R4 ⋅ T 2, R4 ⋅ T 3)⊤ .
Note that the multiplication in Step 3 of Algorithm 1 (Fig. 1) has
one of the operands to be fixed to R4 while the other operand
changes. Goll and Gueron [5] presented a very efficient SIMD
algorithm to perform this multiplication.
IET Inf. Secur., 2020, Vol. 14 Iss. 5, pp. 521-530
© The Institution of Engineering and Technology 2020

The vector T = (T 0, T 1, T 2, T 3) has four elements of F p. Each of
these elements has a 5-limb representation. Let (ti, 4, …, ti, 0) be the
5-limb representation of T i, i = 0, 1, 2, 3. So, a total of twenty 26bit quantities are required to store T. Since intermediate results are
not fully reduced, some of the ti, j’s can be 27-bit quantities. Let
(r4, …, r0) be the 5-limb representation of R4. Also, the vector
(5 ⋅ r4, 5 ⋅ r3, 5 ⋅ r2, 5 ⋅ r1) is stored.
The 4-way SIMD implementation of Goll and Gueron [5] uses
256-bit words. Each 256-bit word is considered to be eight 32-bit
words. So, the twenty 26-bit quantities of T can be stored in three
256-bit words. The vectors (r4, …, r0) and (5 ⋅ r4, 5 ⋅ r3, 5 ⋅ r2, 5 ⋅ r1)
can be stored in two 256-bit words. The multiplication W = R4 ∘ T
consists of two steps.
Vector multiplication step: This step takes as input the three
256-bit words representing T = (T 0, T 1, T 2, T 3) and the two 256-bit
words representing (r4, …, r0) and (5 ⋅ r4, 5 ⋅ r3, 5 ⋅ r2, 5 ⋅ r1). It
produces as output five 256-bit words S0, …, S4, where
Si = (si, 3, si, 2, si, 1, si, 0) and each si, j is a 64-bit word. Further,
(s0, j, s1, j, s2, j, s3, j, s4, j)
is
mult((r4, …, r0), (t j, 4, …, t j, 0))
for
j = 0, …, 3. Let S = (S0, …, S4). By vecMult(R4, T) we will denote
S.
Vector reduction step: This step takes as input S0, …, S4 and
produces as output three 256-bit words which store the twenty 26bit (or 27-bit) words of the result. Let us call the result as W. So,
W = (W 0, W 1, W 2, W 3). Now let us define W j = (w j, 4, …, w j, 0),
j = 0, 1, 2, 3.
Then
(w j, 4, …, w j, 0)
is
the
output
of
reduce(s0, j, s1, j, s2, j, s3, j, s4, j). W will denote vecReduce(S).
In terms of the above notation, the computation W = R4 ∘ T
consists of the following two steps: S ← vecMult(R4, T);
W ← vecReduce(S). Note that T is stored in three 256-bit words
and the output W is also stored in three 256-bit words. This ensures
that the same multiplication algorithm can be applied to multiply
R4 and W, and so on.
We provide the top-level schematics of vecMult(R4, T) and
vecReduce(S). The 5-limb representation (r4, r3, r2, r1, r0) of R4 and
(5 ⋅ r4, 5 ⋅ r3, 5 ⋅ r2, 5 ⋅ r1) are represented in two 256-bit words, as
shown in Fig. 2.
The vectorT = (T 0, T 1, T 2, T 3) is stored in three 256-bit words, as
shown in Fig. 3, where x denotes an undetermined quantity that is
not used in the algorithm.
The Intel AVX2 implementation of vecMult(R4, T), uses a
number of SIMD permutation operations on t0, t1 and t2 followed
by 32-bit SIMD multiplication operations with r0 and r1 and 64-bit
SIMD operations to accumulate the results. Finally, the result of
vecMult(R4, T) is (S0, …, S4) and is stored, as shown in Fig. 4.
The number of different Intel intrinsic instructions required by
vecMult(R4, T) is given in Table 1. The table also provides the
latency and throughput figures in cycles for the Skylake and
Haswell processors (These figures are available from the link
https://software.intel.com/sites/landingpage/IntrinsicsGuide/
(accessed on 12 February 2020). The corresponding figures for
Kaby Lake are not available.)
The vecReduce(S) implementation takes as input the five 256bit words S0, …, S4 and produces as output the vector
W = (W 0, W 1, W 2, W 3) stored in three 256-bit words w0, w1 and w2,
as shown in Fig. 5.
The evaluation in Step 7 of Algorithm 1 is also done using
vector operations. This is not clearly described in [5] and can be
understood from the accompanying code. Since Step 7 is not
relevant to our algorithm, we do not describe the details of its
computation.
3.2 Lazy reduction
Consider the loop in Steps 1–3 of Algorithm 1. Step 3 consists of
one 4-way field multiplication R4 ∘ T followed by a 4-way field
addition. As explained above, the operation R4 ∘ T can be realised
as vecReduce(vecMult(R4, T)). So, R4 ∘ T requires one vecMult
and one vecReduce operations. For long messages, it is possible to
improve this by using a lazy reduction strategy. Such a strategy
523

the
sequence
(D0, …, Dm − 1),
where
Di = Ci for i = 0, …, ℓ − 1 and if ρ > 0, then
Di =

Fig. 6 Algorithm 2: Structure of the new 4-way vectorisation of
Poly1305 computation. Refer to (12) for the definition of the vector
quantities

consists of performing a series of successive vecMult and 4-way
field additions followed by a single reduction. We provide more
details.
Steps 1–3 perform (ℓ′ − 1) 4-way field multiplications and 4way field additions. Suppose we bunch these operations into
groups where each group has λ 4-way field multiplications and λ 4way field additions. If λ does not divide ℓ′ − 1, the last group may
have a lesser number of such operations. With T initialised to C0,
the computation of the jth group, j = 0, …, ⌊(ℓ′ − 1)/λ⌋, processes
Cλ j + 1, …, Cλ j + λ. The actual computation is given as follows:
T ← R4λ ∘ T + R4(λ − 1) ∘ Cλ j + 1
+⋯ + R4 ∘ Cλ j + λ − 1 + Cλ j + λ .

ρ = 0,

In the definition Di = 0, the ‘0’ is the zero element of F p and not
the bit 0. The zero element of F p is represented in binary using a
zero block, which is a binary string consisting of 129 zero bits. In
other words, the sequence (C0, …, Cℓ − 1) is prepended using a
minimum number of zero blocks to make the length a multiple of
4. Since the initial zeros have no effect on the computation of
Poly1305R(D0, …, Dm − 1), we have
Poly1305R(D0, …, Dm − 1) = Poly1305R(C0, …, Cℓ − 1) .
Let us define m′ = m/4. Then, the computation
Poly1305R(D0, …, Dm − 1) can be written as follows:

Wk ← vecMult(R4k, Cλ j + k), k = 1, …, λ − 1;
B ← W0 + ⋯ + Wλ − 1 + Cλ j + λ;
T ← vecReduce(B) .
This method requires λ vecMult operations, λ 4-way field
additions and a single vecReduce operation. Compared to a direct
computation of (9), the lazy reduction strategy reduces the number
of vecReduce operations by roughly a factor of (λ − 1)/λ. This
would suggest that using a higher value of λ should always be
beneficial. This, however, is not the case. As λ increases, so does
the number of pre-computed quantities. The time for precomputation has to be taken into account. For a higher value of λ
not all the pre-computed quantities can be kept in the registers and
as a result, the number of load/store operations would increase
substantially. Also, adding too many of the products without a
reduction can lead to an overfull in the register. These reasons
prevent the use of high values of λ. In [5], the lazy reduction
strategy was used for messages of lengths at least 832 bytes and
with the values of λ to be 2 and 3.

4 New SIMD implementation of Poly1305
Algorithm 1 (Fig. 1) implements the computation in (5). If 4 ℓ (i.e.
ρ = 0), then the 4-way SIMD computation proceeds uniformly
throughout. However, if 4⧸ℓ, then the 4-way SIMD computation in
Algorithm 1 (Fig. 1) proceeds uniformly for ℓ′ steps. Additionally,
the computation in Step 7 is required making the computation nonuniform. For short messages, this leads to a significant penalty.
By making a simple modification, it can be ensured that the 4way SIMD proceeds uniformly throughout. As before, let
ρ = ℓ mod 4. If ρ = 0, let m = ℓ; and if ρ > 0, let m = ℓ + 4 − ρ.
Given the sequence (C0, …, Cℓ − 1) obtained as format(M), define

(10)
of

Poly1305R(D0, …, Dm − 1)
= R4polyR4(D0, D4, …, Dm − 4)

(9)

W0 ← vecMult(R4λ, T);

then

0, i = 0, …, 3 − ρ;
Ci − 4 + ρ, i = 4 − ρ, …, m − 1.

+R3polyR4(D1, D5, …, Dm − 3)

In the above, R4k = (R4k, R4k, R4k, R4k)⊤ for k = 1, …, λ. For the
multiplication by R4k, the field elements R4k and 5 ⋅ R4k are
precomputed and stored (only the first four limbs of 5 ⋅ R4k are
stored).
As written, (9) requires λ 4-way field multiplications and λ 4way field additions. Note however, that the results of the field
multiplications are simply added together. This suggests the
following lazy reduction strategy to perform the computation given
in (9)

524

if

(11)

+R2polyR4(D2, D6, …, Dm − 2)
+R1polyR4(D3, D7, …, Dm − 1) .
In a manner similar to (7), define
Di = (D4i, D4i + 1, D4i + 2, D4i + 3)⊤;

for i = 0, …, m′ − 1.

(12)

The computation in (11) is described in vector form in Algorithm 2
(see Fig. 6).
In Algorithm 2 (Fig. 6), the entire computation can be
performed using 4-way SIMD operations. In other words, by
prepending 0's, the structure of the computation becomes balanced.
It is possible to execute all the multiplications arising in Step 3
using vecMult( ⋅ , ⋅ ) followed by vecReduce( ⋅ ).
For the case ρ = 0, Step 7 of Algorithm 1 (Fig. 1) is not
executed. In this case, Algorithms 1 and 2 (Figs. 1 and 6) become
the same. For ρ = 3, there is a performance improvement of
Algorithm 2 (Fig. 6) over Algorithm 1 (Fig. 1). For the cases of
ρ = 1 and ρ = 2, the situation is more subtle. Directly using
vecMult for the first multiplication in Algorithm 2 (Fig. 6) does
not necessarily lead to speed gains. We address this issue in the
next section.
Remark: We have described Algorithm 2 (Fig. 6) for 4decimated Horner computation. The same idea easily extends to ddecimated Horner computation for any d ≥ 2.
4.1 Improved initial multiplication
In Algorithm 2 (Fig. 6), the first multiplication is R4 ∘ D0. Consider
D0 = (D0, D1, D2, D3)⊤. Depending on the value of ρ, there are four
cases.
(C0, C1, C2, C3)⊤ if ρ = 0;
D0 =

(0, 0, 0, C0)⊤
⊤

if ρ = 1;

(0, 0, C0, C1)

if ρ = 2;

(0, C0, C1, C2)⊤

if ρ = 3.

(13)

Suppose ρ = 1 so that D0 = (0, 0, 0, C0). Let the 5-limb
representation of C0 be given by (c0, 4, …, c0, 0). Consider the
schematic of the operation vecMult as discussed in Section 3.1.
The representation of D0 in the three 256-bit words t0, t1 and t2
will look as shown in Fig. 7 (where x is a do not care value).
Since a lot of entries in the above representation are zeros,
applying the Goll–Gueron vecMult operation to R4 and this D0
IET Inf. Secur., 2020, Vol. 14 Iss. 5, pp. 521-530
© The Institution of Engineering and Technology 2020

Fig. 7 Packing of the input for ρ = 1 if the Goll–Gueron strategy is followed for the first vector multiplication, i.e. for the initial multiplication

Fig. 8 Packing of the input used for the initial multiplication for ρ = 1 by the new algorithm

Fig. 9 Packing the output of the multiplication for ρ = 1 into five 256-bit registers S0, …, S4

Fig. 10 Packing of the input used for the initial multiplication for ρ = 2 by the new algorithm

Fig. 11 Packing the output of the multiplication for ρ = 2 into five 256-bit
registers S0, …, S4

will result in the execution of a number of 32-bit multiplication
operations whose results are known to be zero. By using a different
representation for D0 and a different multiplication algorithm, it is
possible to obtain the desired output using a substantially lower
number of 32-bit SIMD multiplication operations. This leads to
speed improvement.
A similar analysis shows that it is possible to obtain speed
improvement also for the case ρ = 2 for which D0 = (0, 0, C0, C1).
When ρ = 3, D0 = (0, C0, C1, C2), and in this case, the number of
zeros is not sufficient to provide any improvement by using a
multiplication algorithm different from vecMult. Below we
provide the top-level schematics of the improved initial
multiplication algorithms for the cases of ρ = 1 and ρ = 2.
Representation of D0 for the case ρ = 1: In this case,
D0 = (0, 0, 0, C0) is represented using two 256-bit words, as shown
in Fig. 8. The five 256-bit words holding the output of
vecMult(R4, D0) in this case will have the form, as shown in Fig. 9.

IET Inf. Secur., 2020, Vol. 14 Iss. 5, pp. 521-530
© The Institution of Engineering and Technology 2020

Representation of D0 for the case ρ = 2: In this case,
D0 = (0, 0, C0, C1) is represented using two 256-bit words, as shown
in Fig. 10. The five 256-bit words holding the output of
vecMult(R4, D0) in this case will have the form, as shown in
Fig. 11.
In both the cases, the representation of R4 using two 256-bit
words r0 and r1 remain the same as in Section 3.1. The
multiplication algorithms for the above two cases apply SIMD
permutations, 32-bit SIMD multiplications and 64-bit SIMD
additions to produce the required output S in five 256-bit words as
shown above. The vecReduce algorithm mentioned in Section 3.1
is applied to S to obtain the result R4 ∘ D0. The output of
vecReduce is in the form of three 256-bit words which is stored in
T. The further multiplications R4 ∘ T in the loop at Step 4 of
Algorithm 2 (Fig. 6) are performed using the algorithm vecMult
and vecReduce.
Remark: For the case ρ = 2, there are several variants of the
initial multiplication algorithm which avoid multiplications by
zero. For all such variants, the number of _mm256_mul_epu32 is
13.
4.2 Lazy reduction
The lazy reduction strategy described in Section 3.2 applies to the
loop in Steps 1–3 of Algorithm 2 (Fig. 6). The computation is
divided into groups where each group processes λ of the Di’s. As in
the case of Algorithm 1 (Fig. 1), the lazy reduction strategy
requires only a 1/λ fraction of the number of reductions required in
a direct implementation of Algorithm 2 (Fig. 6). Following the
code for [5], we have incorporated the lazy reduction strategy for
messages having at least 832 bytes with values of λ to be 2 or 3.

525

Table 2 Summary of the comparative performance analysis of the new code with the code of [5] using the gcc compiler
Processor
Range
Max speed-up, %
Avg speed-up, %
Haswell

Skylake

Kaby Lake

49B–1000B
1001B–2000B
2001B–4000B
49B–1000B
1001B–2000B
2001B–4000B
49B–1000B
1001B–2000B
2001B–4000B

29.70
22.31
15.15
36.81
15.24
12.66
35.33
21.49
21.17

12.06
12.46
9.06
15.05
6.94
3.29
13.12
12.94
10.51

Table 3 Summary of the comparative performance analysis of the new code with the code of [5] using the clang compiler
Haswell

Skylake

Kaby Lake

49B–1000B
1001B–2000B
2001B–4000B
49B–1000B
1001B–2000B
2001B–4000B
49B–1000B
1001B–2000B
2001B–4000B

5 Implementation and comparison
We have implemented the SIMD strategy given in Algorithm 2
(Fig. 6) for the evaluation of the Poly1305 hash function. This
implementation consisted of modifying the Intel intrinsic code
implementing the SIMD strategy in [5]. Portions of the code were
used without any change. In particular, the basic 4-way
multiplication routine of [5] has been directly used. On the other
hand, the improved initial multiplication algorithms are new to our
SIMD strategy and had to be implemented. The modified code is
publicly available at the following link: https://github.com/Sreyosi/
Improved-SIMD-Implementation-of-Poly1305.
The performance has been measured in terms of the number of
machine cycles per byte under the same conditions as mentioned in
[5]: Intel Turbo Boost Technology, Intel Hyper-Threading
Technology and Intel Speedstep Technology were disabled. For a
comparative study, performances of the new code and that of the
code accompanying [5] were measured using the same strategy and
on the same computers.
Measurements were made on the following platforms:
• Haswell: Intel Core i7-4790 CPU @ 3.60 GHz x 8; running
Ubuntu 18.04.2 LTS (64-bit); gcc version 7.4.0; Clang version
8.4.
• Skylake: Intel Core i7-6500U CPU @ 2.50 GHz x 2; running
Ubuntu 14.04 LTS (64-bit); gcc version 5.5.0; Clang version 8.4.
• Kaby Lake: Intel Core i7-7700U CPU @ 3.60 GHz x 4; running
Ubuntu 18.04 LTS (64-bit); gcc version 7.3.0; Clang version 8.4.
In all cases, measurements were made on a single core of the
specified machines. The compile commands used are as follows:
• gcc -mavx2 -O3 -fomit-frame-pointer,
• clang -mavx2 -O2 -fomit-frame-pointer.
We note that in our experiments, we have found that for the clang
compiler, in general, the-O2 option yields a faster code than the-O3
option. So, we have carried out all measurements using the-O2
option.
Message length: For measuring performance and comparison to
[5], we considered messages with lengths up to 4KB (See http://
www.caida.org/data/passive/trace_stats/nyc-A/2019/equinixnyc.dirA.20190117-130000.UTC.df.xml of the Center for Applied
Internet Data Analysis for the relevance of short messages in IPv4
and IPv6 traffic. Accessed on 27 June 2019.). If the number of 16526

23.33%
14.52%
10.20%
29.61%
21.49%
20.45%
29.43%
18.58%
10.84%

12.92%
5.27%
2.44%
12.95%
8.99%
5.60%
10.57%
4.54%
1.64%

byte blocks in the padded message is a multiple of 4, then the new
code becomes exactly the Goll–Gueron code. Consequently, there
is no difference in performance for such message lengths.
In view of the above, for the purpose of comparing the
performance of the new code to the Goll–Gueron code, we
considered message lengths from 49 bytes to 4000 bytes, which are
not divisible by 64. For each message length, we have taken
measurements of both the Goll–Gueron code and the new code.
Suppose that for a message length l bytes, the Goll–Gueron code
requires t0 cycles/byte and the new code requires t1 cycles/byte.
Then the speed-up (in percentage) attained for message length l is
sul = 100(t1 − t0)/t0. The average speed-up is the average of all the
sul’s.
A top-level summary of the comparison using gcc compiler is
as follows:
• Haswell: speed-up has been obtained in 99.87% cases of the
message lengths that were considered; in 0.07% cases, the
performances of both the codes were the same; in 0.05% cases,
the new code has shown a slight slowdown.
• Skylake: speed-up has been obtained in 89.51% cases of the
message lengths that were considered; in 6.27% cases, the
performances of both the codes were the same; in 4.21% cases,
the new code has shown a slight slowdown.
• Kaby Lake: speed-up has been obtained in 99.66% cases of the
message lengths that were considered; in 0.17% cases, the
performances of both the codes were the same; in 0.15% cases,
the new code has shown a slight slowdown.
Tables 2 and 3 provide a summary of the maximum and average
speed-ups for the three platforms for three different ranges of
message lengths. Table 2 has been obtained using the gcc compiler,
while Table 3 has been obtained using the clang compiler. The
tables show that for short messages, on an average of about 10–
15% speed-up is obtained.
The actual values of cycles/byte that were obtained for the new
code and the code of [5] for 12 different message lengths are
shown in Tables 4–6. Measurements for both clang and gcc are
provided. The message lengths were chosen to show variations in
the improvements obtained by the new code over the code of [5].
Providing measurements for all lengths up to 4000 bytes using
tables is cumbersome. So, we provide plots of such data. Two
kinds of plots are provided for each of the processors. The first
kind plots the speed-up with the size of the message, while the
second kind plots the values of cycles/byte with the message size.
IET Inf. Secur., 2020, Vol. 14 Iss. 5, pp. 521-530
© The Institution of Engineering and Technology 2020

Table 4 Performance on Haswell using clang and gcc
Length
clang
Goll–Gueron
New
Speed-up, %
56
80
96
112
160
224
240
496
600
800
1000
2000

4.71
3.85
3.33
3.02
3.32
2.00
2.08
1.43
1.37
1.14
1.26
0.92

4.36
4.00
3.25
2.61
2.96
1.88
1.78
1.29
1.21
1.00
1.07
0.92

7.43
3.75
2.40
13.57
10.84
6.00
14.42
9.79
11.67
12.28
15.07
0

Table 5 Performance on Skylake using clang and gcc
Length
clang
Goll–Gueron
New
Speed-up, %
56
80
96
112
160
224
240
496
600
800
1000
2000

4.57
3.48
2.98
3.02
2.17
2.92
1.90
1.26
1.20
0.99
1.03
0.78

4.11
3.40
2.92
2.61
2.09
2.08
1.62
1.14
1.07
0.95
0.93
0.77

10.06
2.29
2.01
13.57
3.68
28.76
14.73
9.52
10.83
4.04
9.70
1.28

Table 6 Performance on Kaby Lake using clang and gcc
Length
clang
Goll–Gueron
New
Speed-up, %
56
80
96
112
160
224
240
496
600
800
1000
2000

4.57
3.55
2.94
2.98
2.17
1.78
1.85
1.21
1.17
0.94
1.04
0.78

4.11
3.50
2.90
2.64
2.09
1.67
1.61
1.12
1.08
0.97
0.93
0.77

10.06
1.40
1.36
11.04
3.68
6.17
12.97
7.43
7.69
−3.19
11.00
1.28

These plots are provided only for the gcc compiler. The plots for
the clang compiler are similar and so we do not provide such plots.
For Haswell, Fig. 12 shows the plot of the speed-up of the new
code over the Goll–Gueron code as the message length varies; the
actual values of the cycles/byte measure are shown in Figs. 13–15.
For Skylake, Fig. 16 shows the plot of the speed-up of the new
code over the Goll–Gueron code as the message length varies; the
actual values of the cycles/byte measure are shown in Figs. 17–19.
For Kaby Lake, Fig. 20 shows the plot of the speed-up of the new
code over the Goll–Gueron code as the message length varies; the
actual values of the cycles/byte measure are shown in Figs. 21–23.
Comparison on other processors: The comparative
measurements show the efficiency gain obtained by the new code
over the code of [5]. Due to the availability of computational
resources, we have been able to perform measurements only on
IET Inf. Secur., 2020, Vol. 14 Iss. 5, pp. 521-530
© The Institution of Engineering and Technology 2020

Goll–Gueron

gcc
New

Speed-up, %

4.71
3.70
3.17
3.11
2.30
1.88
1.93
1.37
1.31
1.09
1.28
0.94

4.36
3.50
2.96
2.64
2.17
1.79
1.70
1.24
1.21
1.11
1.17
0.89

7.43
5.40
6.62
15.11
5.62
4.78
11.92
9.48
7.63
−1.83
8.59
5.31

Goll–Gueron

gcc
New

Speed-up, %

4.39
3.33
2.77
2.80
2.05
1.66
1.76
1.21
1.19
0.97
1.01
0.76

3.79
3.23
2.75
2.43
1.94
1.61
1.56
1.14
1.08
0.98
0.92
0.78

13.66
3.00
0.72
13.21
5.36
3.01
11.36
5.78
9.24
-1.03
8.91
-2.63

Goll–Gueron

gcc
New

Speed-up, %

4.43
3.27
2.79
2.86
2.05
1.64
1.75
1.2
1.16
0.94
1.02
0.77

3.79
3.15
2.58
2.27
1.90
1.54
1.44
1.1
1.05
0.97
0.93
0.77

14.44
3.66
7.52
20.62
7.31
6.09
17.71
8.33
9.48
−3.19
8.82
0

three Intel processors. The efficiency improvements, though, are a
result of modifying the algorithm for computing Poly1305 to
better fit vectorised implementations. So, other processors that
support vector implementations should also profit from the new
algorithm.
Alternative implementations of Poly1305: A reviewer of the
paper has pointed out that [7, 8] provide alternative
implementations of Poly1305. The implementation in [7] is faster
than the implementation in [8], so, we consider only [7].
It is mentioned in [7] that the paper utilises ‘the same
optimisation ideas used by the best implementations’, namely those
adopted in OpenSSL and in work by Goll and Gueron [5]. The
implementation strategy in [7] is the following. For messages of
lengths up to 256 bytes, a 64-bit implementation is used. For
messages of lengths greater than 256 bytes, vector implementation
527

Fig. 12 Speed-up (%) versus message (bytes) for gcc and Haswell

Fig. 13 cycles/byte versus message size (49–1000 bytes) for gcc and
Haswell

Fig. 14 cycles/byte versus message size (1001–2000 bytes) for gcc and
Haswell

is used in a manner that is similar to that of the Goll–Gueron
strategy. Suppose there are ℓ blocks, and as before let ρ = ℓ mod 4
and ℓ′ = (ℓ − ρ)/4. The vector evaluation is done four blocks at a
time up to ℓ′ blocks. Then the result is reformatted into 64-bit
words and the remaining ρ blocks are processed using 64-bit
arithmetic in a manner similar to the processing of messages of
lengths up to 256 bytes.
The implementation strategy that we suggest, namely to balance
out the 4-way SIMD execution by pre-pending some zero blocks, is
not present in [7]. If this strategy is adopted into the
implementation of [7], then the reformatting into 64-bit words and
processing the left-over message consisting of ρ blocks using 64bit arithmetic will not be required.
528

Fig. 15 cycles/byte versus message size (2001–4000 bytes) graph for gcc
and Haswell

Fig. 16 Speed-up (%) versus message size (bytes) graph for gcc and
Skylake

Fig. 17 cycles/byte versus message size (49–1000 bytes) graph for gcc
and Skylake

The code for [7] has been written in the Jasmin language and
the Jasmin compiler has been used to generate the assembly,
which, according to the authors, is ‘as efficient as a hand-written
assembly’. The code that we use, on the other hand, is in Intel
intrinsic and compiled using gcc/clang. In general, intrinsic
compiled code is slower than hand-written assembly and this holds
for the comparison between the Jasmin compiled code and our
code. A possible future work would be to code the strategy for
balancing 4-way SIMD that we suggest into Jasmin and then
perform a comparison with the code of [7].

IET Inf. Secur., 2020, Vol. 14 Iss. 5, pp. 521-530
© The Institution of Engineering and Technology 2020

Fig. 18 cycles/byte versus message size (1001–2000 bytes) graph for gcc
and Skylake

Fig. 21 cycles/byte versus message size (49–1000 bytes) graph for gcc
and Kaby Lake

Fig. 19 cycles/byte versus message size (2001–4000 bytes) graph for gcc
and Skylake

Fig. 22 cycles/byte versus message size (1001–2000 bytes) graph for gcc
and Kaby Lake

Fig. 20 Speed-up (%) versus message size (bytes) graph for gcc and Kaby
Lake

Fig. 23 cycles/byte versus message size (2001–4000 bytes) graph for gcc
and Kaby Lake

6 Conclusion

earlier version of this paper. They also thank the reviewers for their
kind comments, which have helped in improving the paper.

In this work, we have proposed a simple modification to the
previous Goll–Gueron strategy for SIMD implementations of the
Poly1305 algorithm. Implementation of the modified algorithm
shows noticeable speed improvements on modern Intel processors
for short messages when the number of blocks is not a multiple of
four.

7 Acknowledgment
The authors thank Shay Gueron for kindly sharing the code
associated with [5] with them and for providing comments on an
IET Inf. Secur., 2020, Vol. 14 Iss. 5, pp. 521-530
© The Institution of Engineering and Technology 2020

8
[1]
[2]
[3]

References
Bernstein, D.J.: ‘The Poly1305-AES-aes message-authentication code’. Fast
Software Encryption: 12th Int. Workshop, FSE 2005, Paris, France, 21–23
February 2005, Revised Selected Papers, (LNCS, 3557), pp. 32–49
Bernstein, D.J.: ‘Chacha, a variant of Salsa20’. Workshop Record of SASC
2008: The State of the Art of Stream Ciphers, Lausanne, Switzerland, January
2008. Available at http://cr.yp.to/chacha/chacha-20080128.pdf
Bernstein, D.J., Lange, T., Schwabe, P.: ‘The security impact of a new
cryptographic library’. Progress in Cryptology – LATINCRYPT 2012 – 2nd

529

[4]
[5]
[6]

530

Int. Conf. on Cryptology and Information Security in Latin America,
Santiago, Chile, 7–10 October 2012 (LNCS, 7533), pp. 159–176
Bernstein, D.J.: ‘The Salsa20 family of stream ciphers’. Available at http://
cr.yp.to/papers.html#salsafamily.
Document
ID:
31364286077dcdff8e4509f9ff3139ad. Date: 2007.12.25
Goll, M., Gueron, S.: ‘Vectorization of Poly1305 message authentication
code’. 2015 12th Int. Conf. on Information Technology – New Generations,
Las Vegas, NV, USA, April 2015, pp. 145–150, doi: 10.1109/ITNG.2015.28
Chakraborty, D., Ghosh, S., Sarkar, P.: ‘A fast single-key two-level universal
hash function’, IACR Trans. Symmetric Cryptol., 2017, 2017, (1), pp. 106–
128

[7]
[8]

Almeida, J.B., Barbosa, M., Barthe, G., et al.: ‘The last mile: high-assurance
and high-speed cryptographic implementations’. CoRR, abs/1904.04606,
2019
Delignat-Lavaud, A., Fournet, C., Kohlweiss, M., et al.: ‘Implementing and
proving the TLS 1.3 record layer’. 2017 IEEE Symp. on Security and Privacy,
SP 2017, San Jose, CA, USA, 22–26 May 2017, pp. 463–482

IET Inf. Secur., 2020, Vol. 14 Iss. 5, pp. 521-530
© The Institution of Engineering and Technology 2020

