Abstract. Poly1305-AES is a state-of-the-art message-authentication code suitable for a wide variety of applications. Poly1305-AES computes a 16-byte authenticator of a variable-length message, using a 16-byte AES key, a 16-byte additional key, and a 16-byte nonce. The security of Poly1305-AES is very close to the security of AES; the security gap is at most 14D L/16 /2 106 if messages have at most L bytes, the attacker sees at most 2 64 authenticated messages, and the attacker attempts D forgeries. Poly1305-AES can be computed at extremely high speed: for example, fewer than 3.1 + 780 Athlon cycles for an -byte message. This speed is achieved without precomputation; consequently, 1000 keys can be handled simultaneously without cache misses. Special-purpose hardware can compute Poly1305-AES at even higher speed. Poly1305-AES is parallelizable, incremental, and not subject to any intellectualproperty claims.
Introduction
This paper introduces and analyzes Poly1305-AES, a state-of-the-art secret-key message-authentication code suitable for a wide variety of applications.
Poly1305-AES computes a 16-byte authenticator Poly1305 r (m, AES k (n)) of a variable-length message m, using a 16-byte AES key k, a 16-byte additional key r, and a 16-byte nonce n. Section 2 of this paper presents the complete definition of Poly1305-AES.
Poly1305-AES has several useful features:
• Guaranteed security if AES is secure. The security gap is small, even for long-term keys; the only way for an attacker to break Poly1305-AES is to break AES. Assume, for example, that messages are packets up to 1024 bytes; that the attacker sees 2 64 messages authenticated under a Poly1305-AES key; that the attacker attempts a whopping 2 75 forgeries; and that the attacker cannot break AES with probability above δ. Then, with probability at least 0.999999 − δ, all of the 2 75 forgeries are rejected.
Messages
Poly1305-AES authenticates messages. A message is any sequence of bytes m[0], m [1] , . . . , m[ − 1]; a byte is any element of {0, 1, . . . , 255}. The length can be any nonnegative integer, and can vary from one message to another.
Keys
Poly1305-AES authenticates messages using a 32-byte secret key shared by the message sender and the message receiver. The key has two parts: first, a 16-byte AES key k; second, a 16-byte string r[0], r [1] , . . . , r [15] . The second part of the key represents a 128-bit integer r in unsigned little-endian form: i.e., r = r[0] + 2 8 r[1] + . . . + 2 120 r [15] . Certain bits of r are required to be 0: r [3] , r [7] , r [11] , r [15] are required to have their top four bits clear (i.e., to be in {0, 1, . . . , 15}), and r [4] , r [8] , r [12] are required to have their bottom two bits clear (i.e., to be in {0, 4, 8, . . . , 252}). Thus there are 2 106 possibilities for r. In other words, r is required to have the form r 0 + r 1 +r 2 +r 3 where r 0 ∈ 0, 1, 2, 3, . . . , 
Nonces
Poly1305-AES requires each message to be accompanied by a 16-byte nonce, i.e., a unique message number. Poly1305-AES feeds each nonce n through AES k to obtain the 16-byte string AES k (n). There is nothing special about AES here. One can replace AES with an arbitrary keyed function from an arbitrary set of nonces to 16-byte strings. This paper focuses on AES for concreteness. In other words: Pad each 16-byte chunk of a message to 17 bytes by appending a 1. If the message has a final chunk between 1 and 15 bytes, append 1 to the chunk, and then zero-pad the chunk to 17 bytes. Either way, treat the resulting 17-byte chunk as an unsigned little-endian integer.
Conversion and padding

Authenticators
Poly1305 r (m, AES k (n)), the Poly1305-AES authenticator of a message m with nonce n under secret key (k, r), is defined as the 16-byte unsigned little-endian representation of
Here the 16-byte string AES k (n) is treated as an unsigned little-endian integer, and c 1 , c 2 , . . . , c q are the integers defined above. See Appendix B for examples.
Sample code
The 106 -here C is the number of messages authenticated by the sender, D is the number of forgery attempts, and L is the maximum message length-and thus cannot be used with confidence for large C. Second, nonces allow the invocation of AES to be carried out in parallel with most of the other operations in Poly1305-AES, reducing latency in many contexts. Third, most protocols have nonces anyway, for a variety of reasons: nonces are required for secure encryption, for example, and nonces allow trivial rejection of replayed messages.
I constrained r to simplify and accelerate implementations of Poly1305-AES in various contexts. A wider range of r-e.g., all 128-bit integers-would allow a quantitatively better security bound, but the current bound DL/2 106 will be perfectly satisfactory for the foreseeable future, whereas slower authenticator computations would not be perfectly satisfactory.
I chose little-endian instead of big-endian to improve overall performance. Little-endian saves time on the most popular CPUs (the Pentium and Athlon) while making no difference on most other CPUs (the PowerPC, for example, and the UltraSPARC).
The definition of Poly1305-AES could easily be extended from byte strings to bit strings, but there is no apparent benefit of doing so.
Security
This section discusses the security of Poly1305-AES.
Responsibilities of the user
Any protocol that uses Poly1305-AES must ensure unpredictability of the secret key (k, r). This section assumes that secret keys are chosen from the uniform distribution: i.e., probability 2 −234 for each of the 2 234 possible pairs (k, r). Any protocol that uses Poly1305-AES must ensure that the secret key is, in fact, kept secret. This section assumes that all operations are independent of (k, r), except for the computation of authenticators by the sender and receiver.
(There are safe ways to reuse k for encryption, but those ways are not analyzed in this paper.)
The sender must never use the same nonce for two different messages. The simplest way to achieve this is for the sender to use an increasing sequence of nonces in, e.g., reverse-lexicographic order of 16-byte strings. (Problem: If a key is stored on disk, while increasing nonce values are stored in memory, what happens when the power goes out? Solution: Store a safe nonce value-a new nonce larger than any nonce used-on disk alongside the key.) Any protocol that uses Poly1305-AES must specify a mechanism of nonce generation and maintenance that prevents duplicates.
Security guarantee
Poly1305-AES guarantees that the only way for the attacker to find an (n, m, a) such that a = Poly1305 r (m, AES k (n)), other than the authenticated messages (n, m, a) sent by the sender, is to break AES. If the attacker cannot break AES, and the receiver discards all (n, m, a) such that a = Poly1305 r (m, AES k (n)), then the receiver will see only messages authenticated by the sender.
This guarantee is not limited to "meaningful" messages m. It is true even if the attacker can see all the authenticated messages sent by the sender. It is true even if the attacker can see whether the receiver accepts a forgery. It is true even if the attacker can influence the sender's choice of messages and unique nonces. (But it is not true if the nonce-uniqueness rule is violated.)
Here is a quantitative form of the guarantee. Assume that the attacker sees at most C authenticated messages and attempts at most D forgeries. Assume that the attacker has probability at most δ of distinguishing AES k from a uniform random permutation after C + D queries. Assume that all messages have length at most L. Then, with probability at least
all of the attacker's forgeries are discarded. In particular, if C ≤ 2 64 , then the attacker's chance of success is at most δ + 1.649
106 . The most important design goal of AES was for δ to be small. There is, however, no hope of proving that δ is small. Perhaps AES will be broken someday.
If that happens, users should switch to Poly1305-AnotherFunction. Poly1305-AnotherFunction provides the same security guarantee relative to the security of AnotherFunction.
Proof of the security guarantee
For each message m, write m for the polynomial c 1 x q +c 2 x q−1 +· · ·+c q x 1 , where q, c 1 , c 2 , . . . , c q are defined as in Section 2. Define H r (m) as the 16-byte unsigned little-endian representation of (m(r) mod 2 130 − 5) mod 2 128 ; note that H r and k are independent. Define a group operation + on 16-byte strings as addition modulo 2 128 , where each 16-byte string is viewed as the unsigned little-endian representation of an integer in 0, 1, 2, . . . , 2 128 − 1 . Then the authenticator Poly1305 r (m, AES k (n)) is equal to H r (m) + AES k (n).
The crucial property of H r is that it has small differential probabilities: if g is a 16-byte string, and m, m are distinct messages of length at most L, then H r (m) = H r (m ) + g with probability at most 8 L/16 /2
106 . See below. Theorem 5.4 of [8] now guarantees that H r (m) + AES k (n) is secure if AES is secure: specifically, that the attacker's success chance against Define as the number of bytes in m. Recall that q = /16 ; thus is between 16q − 15 and 16q. The exact value of is determined by q and c q : it is 16q if 2
Hence m also has bytes. Now consider any j ∈ {0, 1, . . . , − 1}. Write i = j/16 + 1; then 16i − 16 ≤ j ≤ 16i − 1, and 
Consequently, if #R = 2
106 , and if r is a uniform random element of R, then H r (m) = H r (m ) + g with probability at most 8 L/16 /2 
A floating-point implementation
This section explains how to compute Poly1305 r (m, AES k (n)), given (k, r, n, m), at very high speeds on common general-purpose CPUs.
These techniques are used by my poly1305aes software library to achieve the Poly1305-AES speeds reported in Section 1. See Appendix A for further speed information. The software itself is available from http://cr.yp.to/mac.html.
The current version of poly1305aes includes separate implementations of Poly1305-AES for the Athlon, the Pentium, the PowerPC, and the UltraSPARC; it also includes a backup C implementation to handle other CPUs. This section focuses on the Athlon for concreteness.
Outline
The overall strategy to compute Poly1305 r (m, AES k (n)) is as follows. Start by setting an accumulator h to 0. For each chunk c of the message m, first set h ← h + c, and then set h ← rh. Periodically reduce h modulo 2 130 − 5, not necessarily to the smallest remainder but to something small enough to continue the computation. After all input chunks c are processed, fully reduce h modulo 2 130 − 5, and add AES k (n).
Large-integer arithmetic in floating-point registers
Represent each of h, c, r as a sum of floating-point numbers, as in [7] . Specifically:
• As in Section 2, write r as r 0 + r 1 + r 2 + r 3 where r 0 ∈ 0, 1, 2, . . • Write h as h 0 + h 1 + h 2 + h 3 where h i is a multiple of 2 32i in the range specified below. Store each h i in one of the Athlon's floating-point registers.
Warning: The FreeBSD operating system starts each program by instructing the CPU to round all floating-point mantissas to 53 bits, rather than using the CPU's natural 64-bit precision. Make sure to disable this instruction. Under gcc, for example, the code asm volatile("fldcw %0"::"m"(0x137f)) specifies full 64-bit mantissas.
To set h ← h + c, set 
Recall that 2 
Output conversion
After the last message chunk is processed, carry one last time, to put h 0 , h 1 , h 2 , h 3 into the small ranges listed above.
Add 2
into the range 0, 1, . . . , 2(2 130 − 5) − 1 . Perform a few integer add-with-carry operations to convert the accumulator into a series of 32-bit words in the usual form. Subtract 2 130 − 5, and keep the result if it is nonnegative, being careful to use constant-time operations so that no information is leaked through timing.
Finally, add AES k (n). There are two reasons to pay close attention to the AES computation:
• It is extremely difficult to write high-speed constant-time AES software.
Typical AES software leaks key bytes to the simplest conceivable timing attack. See [6] . My new AES implementations go to extensive effort to reduce the AES timing variability.
• The time to compute AES k (n) from (k, n) is more than half of the time to compute Poly1305 r (m, AES k (n)) for short messages, and remains quite noticeable for longer messages. My new AES implementations are, as far as I know, the fastest available software for computing AES k (n) from (k, n). Of course, if there is room in cache, then one can save some time by instead computing AES k (n) from (K, n), where K is a pre-expanded version of k.
Details of the AES computation are not discussed in this paper but are discussed in the poly1305aes documentation.
Instruction selection and scheduling
Consider an integer (such as d 0 ) between 0 and 2 32 − 1, stored in the usual way as four bytes. How does one load the integer into a floating-point register, when the Athlon does not have a load-four-byte-unsigned-integer instruction? Here are three possibilities:
• Concatenate the four bytes with (0, 0, 0, 0), and use the Athlon's load-eightbyte-signed-integer instruction. Unfortunately, the four-byte store forces the eight-byte load to wait for dozens of cycles.
• Concatenate the bytes with (0, 0, 56, 67), producing an eight-byte floatingpoint number. Load that number, and subtract 2 52 +2 51 to obtain the desired integer. This well-known trick has the virtue of also allowing the integer to be scaled by (e.g.) 2
32 : replace 67 with 69 and 2 52 + 2 51 with 2 84 + 2 83 . Unfortunately, as above, the four-byte store forces the eight-byte load to wait for dozens of cycles.
• Subtract 2
31 from the integer, use the Athlon's load-four-byte-signed-integer instruction, and add 2 31 to the result. This has smaller latency, but puts more pressure on the floating-point unit.
Top performance requires making the right choice.
(A variant of Poly1305-AES using signed 32-bit integers would save time on the Athlon. On the other hand, it would lose time on typical 64-bit CPUs.) This is merely one example of several low-level issues that can drastically affect speed: instruction selection, instruction scheduling, register assignment, instruction fetching, etc. A "fast" implementation of Poly1305-AES, with just a few typical low-level mistakes, will use twice as many cycles per byte as the software described here.
Other modern CPUs
The same floating-point operations also run at high speed on the Pentium 1, Pentium MMX, Pentium Pro, Pentium II, Pentium III, Pentium 4, Pentium M, Celeron, Duron, et al.
The UltraSPARC, PowerPC, et al. support fast arithmetic on floating-point numbers with 53-bit, rather than 64-bit, mantissas. The simplest way to achieve good performance on these chips is to break a 32-bit number into two 16-bit pieces before multiplying it by another 32-bit number.
As in the case of the Athlon, careful attention to low-level CPU details is necessary for top performance.
Other implementation strategies
Some people, upon hearing that there is a tricky way to use the Athlon's floatingpoint unit to compute a function quickly, leap to the unjustified conclusion that the same function cannot be computed quickly except on an Athlon. Consider, for example, the incorrect statement "hash-127 needs good hardware support for a fast implementation" in [17, footnote 3] .
This section outlines three non-floating-point methods to compute Poly1305-AES, and indicates contexts where the methods are useful.
Integer registers
The 130-bit accumulator in Poly1305-AES can be spread among several integer registers rather than several floating-point registers.
This is good for low-end CPUs that do not support floating-point operations but that still have reasonably fast integer multipliers. It is also good for some high-end CPUs, such as the Athlon 64, that offer faster multiplication through integer registers than through floating-point registers.
Tables
One can make a table of the integers r, 2r, 4r, 8r, . . . , 2 129 r modulo 2 130 − 5, and then multiply any 130-bit integer by r by adding, on average, about 65 elements of the table.
One can reduce the amount of work by using both additions and subtractions, by increasing the table size, and by choosing table entries more carefully. For example, one can include 3r, 24r, 192r, . . . in the table, and then multiply any 130-bit integer by r by adding and subtracting, on average, about 38 elements of the table. This is a special case of an algorithm often credited to Brickell, Gordon, McCurley, Wilson, Lim, and Lee, but actually introduced much earlier by Pippenger in [23] .
One can also balance the table size against the effort in reduction modulo 2 130 − 5. Consider, for example, the table r, 2r, 3r, 4r, . . . , 255r. Table lookups are often the best approach for tiny CPUs that do not have any fast multiplication operations. Of course, their key agility is poor, and they are susceptible to timing attacks if they are not implemented very carefully.
Special-purpose circuits
An 1800MHz AMD Duron, costing under $50, can feed 4 gigabits per second of 1500-byte messages through Poly1305-AES with the software discussed in Section 4. Hardware implementations of Poly1305-AES can strip away a great deal of unnecessary cost: the multiplier is only part of the cost of the Duron; furthermore, some of the multiplications are by sparse constants; furthermore, only about 20% of the multiplier area is doing any useful work, since each input is much smaller than 64 bits; furthermore, almost all carries can be deferred until the end of the Poly1305-AES computation, rather than being performed after each multiplication; furthermore, hardware implementations need not, and should not, imitate traditional software structures-one can directly build a fast multiplier modulo 2 130 −5, taking advantage of more sophisticated multiplication algorithms than those used in the Duron. Evidently Poly1305-AES can handle next-generation Ethernet speeds at reasonable cost.
