Abstract This paper presents new speed records for multiprecision multiplication on the AVR ATmega family of 8-bit microcontrollers. For example, our software takes only 1,969 cycles for the multiplication of two 160-bit integers; this is more than 15 % faster than that demonstrated in previous work. For 256-bit inputs, our software is not only the first to break through the 6,000-cycle barrier; with only 4,771 cycles it also breaks through the 5,000-cycle barrier and is more than 21 % faster than previous work. We achieve these speed records by carefully optimizing the Karatsuba multiplication technique for AVR ATmega. One might expect that subquadratic-complexity Karatsuba multiplication is only faster than algorithms with quadratic complexity for large inputs. This paper shows that it is in fact faster than fully unrolled product-scanning multiplication already for surprisingly small inputs, starting at 48 bits. Our results thus make Karatsuba multiplication the method of choice for high-performance imple-
Introduction
How much effort is it to multiply two integers? Over the past six decades, many researchers have attempted to answer this question. One main line of research is concerned with the asymptotic complexity of integer multiplication. In the 1950s, Kolmogorov conjectured that the complexity for the multiplication of two n-digit integers is Θ(n 2 ). This conjecture was proven wrong by one of his students, Karatsuba, in 1960 , who presented a multiplication algorithm with asymptotic complexity Θ(n log 2 3 ). This ground-breaking result was published in 1962 [14] . See [15, Section 6] for a history of this publication. In 1963, Toom lowered the complexity further to n2 Θ( √ log 2 n) [27] and the FFT-based method by Schönhage and Strassen achieves asymptotic complexity Θ(n log n log log n) [21] . The latest result in this line of research is Fürer's algorithm with asymptotic complexity Θ(n log n2 log * n ) [5] . Another main line of research is concerned with the number of clock cycles required for multiplication of two integers of a particular fixed size on a particular micro-architecture. This paper presents new results from this second line of research. Specifically, we present new multiplication speed records for multiplication of integers between 48 and 256 bits on the AVR ATmega family of microcontrollers. We expect that the techniques described in this paper can be extended to larger inputs. However, in this paper we are mainly interested in input sizes in the range between 160 and 256 bits that we typically encounter in elliptic-curve cryptography on embedded processors. The software presented in this paper takes time independent of the inputs and is thus suitable for use in timing-attack-protected implementations of cryptographic primitives.
All previous speed records for multiplication of integers of up to 256 bits on AVR are achieved by algorithms with quadratic complexity. We obtained our speed records by optimizing the Karatsuba multiplication algorithm for the AVR ATmega architecture. We do not claim novelty for any particular technique we used. What is new is the combination of techniques and careful hand-optimization of multiprecision multiplication for AVR ATmega microcontrollers.
Notes on the naming. While working on this paper we observed that the term "schoolbook multiplication" has a different meaning for different people and in different contexts. Sometimes, it only refers to operand-scanning multiplication. Other techniques with quadratic complexity, such as product-scanning multiplication, hybrid multiplication [9] , or operand-caching multiplication [12] are not considered to be "schoolbook". For an example of this naming convention, see [12, Section 3] . However, when distinguishing multiplication algorithms with different asymptotic complexities, "schoolbook multiplication" is often used to refer to any quadratic-complexity algorithm. See, for example, [9, Section 3] . To avoid confusion we avoid the term "schoolbook multiplication" throughout this paper.
It is common to refer to the product-scanning technique as "Comba multiplication", and to give credit to a 1990 paper by Comba [4] . See, for example, [6, 12, 22] . However, the technique has earlier been described (without a claim of novelty) by Barrett in [1, Diagram three] . The method has in fact already been described by Leonardo Pisano (Fibonacci) in his work "Liber Abaci" from 1202; see [25, Chapter 2] . Swetz in [26, Chapter 4] states that the "cross method of multiplication" can be traced back to the Lilavati by Bhāskara from 1150, but we were not able to confirm this in the English translation by Patwardhan, Naimpally, and Singh [20] . We will use the term "product-scanning multiplication".
Related work.
Many results exist on fast multiprecision multiplication on embedded processors, often in the context of modular arithmetic and elliptic-curve cryptography. Some papers also consider the Karatsuba technique for multiplication on embedded processors. For example, Großschädl, Avanzi, Savaş, and Tillich use the Karatsuba technique for fast and energy-efficient multiplication of 512-bit and larger integers on StrongARM [8] ; Gouvêa, Oliveira, and López use Karatsuba for 256-bit multiplication on the MSP430 [6] ; Gouvêa, Oliveira, and López use it for fast 256-bit multiplication on the MSP430X [7] .
On AVR ATmega microcontrollers the state of the art in multiplication of integers of up to 256 bits has consistently been held by algorithms with quadratic complexity. Until 2004 the fastest known algorithm was product-scanning multiplication. For inputs of size larger than 96 bits this changed with the introduction of hybrid multiplication by Gura, Patel, Wander, Eberle and Chang Shantz in [9] . This algorithm was later improved by Scott and Szczechowiak in [22] . The next milestone in optimizing multiprecision multiplication on AVR was the best paper of CHES 2011 by Hutter and Wenger that introduced operand caching multiplication [12] . The results of that paper were slightly improved in two follow-up papers by Seo and Kim, which introduced and optimized consecutive operand caching [23, 24] . These papers mark the current state of the art in multiprecision multiplication on AVR ATmega microcontrollers. Note that the use of hybrid multiplication is covered by a patent [10] ; a patent for operand-caching multiplication is pending [13] .
We are aware of only two papers considering the Karatsuba technique for multiplication of big integers on AVR ATmega. Both papers conclude that Karatsuba multiplication is slower than quadratic-complexity multiplication algorithms for input sizes commonly used in elliptic-curve cryptography. In [18] Both "Hybrid Separated Operand Scanning" (HSOS) and "Hybrid Finely Integrated Product Scanning" (HFIPS) are algorithms with quadratic complexity.
In [11] , we used Karatsuba for multiplication of 256-bit integers; however, with 6,686 cycles, that approach turned out to be considerably slower than state-of-the-art operandcaching and consecutive-operand-caching multiplication.
Availability of Software. We placed all software described in this paper into the public domain. 1 We will not apply for any patents for the techniques described in this paper.
Organization of the paper. Section 2 briefly reviews the specifics of the AVR ATmega family of microcontrollers. Section 3 first considers efficient approaches for small multiprecision multiplication, then discusses different approaches for implementing Karatsuba multiplication on AVR ATmega, and finally derives a lower bound on the number of cycles purely from arithmetic instructions. Section 4 describes how we minimize the number of loads and stores in Karatsuba multiplication for different input sizes to translate the lower computational complexity to lower clock-cycle counts. In Sect. 5, we present detailed performance results of our software and compare them with the best results from the literature. We conclude the paper and give ideas for future work in Sect. 6.
The AVR ATmega architecture
This paper optimizes Karatsuba multiplication for the AVR ATmega family of 8-bit microcontrollers. Many of the techniques we describe apply in a similar way on other architectures, but the concrete application of these techniques and the cost analysis are specific to the 8-bit AVR ATmega architecture. This section briefly reviews the specifics of this architecture that are relevant to the remainder of the paper.
Register set. The AVR has 32 registers labeled R0,…, R31. The register pair (R26, R27) is aliased as X, the register pair (R28, R29) is aliased as Y, and the register pair (R30, R31) is aliased as Z. These three register pairs are the only ones that can hold the address argument of a load or store instruction. The register pair (R0, R1) is special because it holds the output of a multiplication instruction (see below).
Memory access.
All load and store instructions on the ATmega take two cycles. The LD load instruction and the ST store instruction access memory at the address specified in their argument (either X, Y, or Z). They can post-increment or pre-decrement their two-register argument for free. The LDD load instruction takes a constant offset to the address register as second argument and so does the STD store instruction.
The standard way to use the stack is to use the instructions PUSH and POP. However, it is also possible to use two IN instructions to copy the stack pointer into one of the address register pairs X, Y, or Z and then operate on the stack with LD/LDD and ST/STD instructions. Writing back the stack pointer takes two OUT instructions.
Arithmetic instructions.
Our software makes use of only relatively few arithmetic instructions. Most important is the MUL instruction, which multiplies the 8-bit unsigned integers in its two register arguments; the 16-bit result is written to the register pair (R0, R1). The MUL instruction takes two cycles and it overwrites the carry flag. Addition (ADD, ADC), subtraction (SUB, SBC), and exclusive or (EOR) are two-operand instructions; one of their inputs is overwritten by the output. For subtraction it is always the minuend that is overwritten. Another helpful instruction is SBCI which performs subtraction with borrow of an immediate value from a register. There is no equivalent "ADCI" instruction to perform addition with carrying of an immediate value. The CLR instruction sets a register to zero; the MOV instruction copies the value in one register to another register. The MOVW instruction copies a register pair to another register pair. Note that two adjacent registers are a register pair only if the lower register is "even" (i.e., R0, R2, . . . ). It is worth noting that MOVW, like all other arithmetic instructions except MUL, takes only one cycle.
Aside from the typical flags (like carry, zero, etc.), the AVR also features a T flag, which can be used to remember a single bit. The BST instruction stores one bit of a given register to the T flag; the BLD instruction loads from the T flag into one bit of a given register. It is possible to perform conditional branches depending on the value of the T flag.
C function-call ABI. The avr-gcc function-call ABI specifies that the first three 16-bit arguments (e.g., pointers) are passed in register pairs (R24, R25), (R22, R23), and (R20, R21). It furthermore specifies that registers R2-R17, R28, and R29 are caller registers and that register R1 has to be set to zero before returning from a function. Our software follows these conventions to make it directly usable from C code, but as in previous papers we do not include functioncall-ABI related overhead in our cycle counts. See Sect. 5.
Arithmetic considerations
In this section we consider the pure arithmetic cost, i.e., ignoring costs for loads and stores, of Karatsuba multiplication on AVR ATmega. We start with fixing the representation of big integers and reviewing the arithmetic cost of small multiprecision multiplications to establish a baseline.
Representation of big integers. Throughout this paper we will represent big integers in unsigned radix-2 8 representation, i.e., an 8m-bit integer A is represented in m bytes (a 0 , . . . , a m−1 ) with A = m−1 i=0 a i 2 8i and a i ∈ {0, . . . , 255}. This big-integer representation is standard for AVR throughout the literature. We do not expect any benefits from using a signed representation or a "carry-save" representation, which leaves some bits on the top of each limb free to accumulate carries.
Small multiprecision multiplications.
Karatsuba multiplication, like hybrid multiplication, constructs full-size multiprecision multiplication from blocks of smaller multiplications. The block sizes that are most relevant for this paper are 24 × 24 bits, 32 × 32 bits, 40 × 40 bits, and 48 × 48 bits. One obvious way to handle those "small multiprecision" multiplications is to use operand scanning or product scanning. However, this is not optimal as demonstrated in the context of inner-loop optimization for the hybrid multiplication by Lederer, Mader, Koschuch, Großschädl, Szekely, and Tillich in [16] , by Liu, Großschädl, and Kizhvatov in [18] , and most recently by Liu and Großschädl in [17] . 
Corresponding counts of fully unrolled product-scanning multiplication are listed in parentheses
For 32 × 32-bit multiplication we adapted the technique described in Liu and Großschädl [17, Section 3.1] for multiplication (the original algorithm performs multiplyaccumulate). Inspired by this technique we wrote similar routines for 24 × 24-bit, for 40 × 40-bit, and for 48 × 48-bit multiplication. Table 1 lists instruction and cycle counts for those small multiplications; the corresponding code listings are given in Appendix B. Note that our routines are slightly different from the one by Liu and Großschädl in the sense that they can be seen as tweaked operand scanning. We assume that inputs are already loaded to registers and that outputs are also kept in registers. The cost for loads and stores depends on the context in which these multiplication blocks are used.
We do not claim speed records for these small multiplications, although we are not aware of any faster results. We would expect that there exist thoroughly optimized routines for these input sizes that are in the range of standard C data types. We were surprised to find the currently fastest approach somewhat "hidden" as an inner-loop optimization of big-integer hybrid multiplication in a paper on Montgomery modular multiplication.
Note that the optimized small multiprecision multiplications need slightly more live registers than fully unrolled product-scanning multiplications. Whether they are better than product scanning or not depends on the context, i.e., the amount of registers that are available without spilling.
Additive vs. subtractive Karatsuba. From now on we are considering n × n-byte multiplication, where n is even and k = n/2. The typical way to describe Karatsuba multiplication of an n-byte integer A= (a 0 , . . . , a n−1 ) and n-byte integer B= (b 0 , . . . , b n−1 ) is the following:
We will refer to this approach as additive Karatsuba. The problem with this approach is that the additions of two k-byte numbers A + A h and B + B h produce carry bits. An efficient way to handle multiplications by such a carry bit during the computation of M is to perform a subtraction-with-carry from a zero register to produce a register that is either 0xff (if the carry is one) or zero and then compute multiplication through an AND instruction with this register. Subsequent accumulation of the one-byte result of such a multiplication costs only two addition instructions (one ADD and one ADC) instead of three instructions for two-byte results.
The problem with this approach is twofold: first the multiplications by carry bits still contribute a significant overhead. Second the tweak to use AND instructions only works for a single carry bit. Recursive application of Karatsuba's technique yields multiple carry bits which have to be handled by full multiplication and accumulation. It turns out to be more efficient to use subtractive Karatsuba:
For a proof of correctness of subtractive Karatsuba, see, for example, [3, Section 1.3.2]. The subtractive variant of Karatsuba avoids the carry bits in the computation of M but instead needs to compute two absolute differences |A − A h | and |B − B h | and one conditional negation of M. This has to be done in constant time to make the multiplication routine suitable for timing-attack-protected implementations of cryptographic primitives.
Constant-time absolute differences. We compute |A − A h | as follows: first perform a subtraction of A − A h which costs k subtraction instructions. Then we use a subtract-with-carry of a register from itself to obtain a register with the valueinstructions, and k EOR instructions, adding up to a total of 3k + 1 instructions accounting for 3k + 1 cycles. The computation of |B − B h | is done in the same way. We obtain the value of t required for the conditional negation of M as
Constant-time conditional negation. The most obvious way to compute L + H − (−1) t M, given M, is to use a conditional branch that either adds or subtracts M, depending on the value of t. Note that the EOR instruction which we use to compute t sets the zero flag, which we can then use for the branch condition. On many platforms such a conditional branch would inevitably create a timing leak. The AVR does not have any branch-prediction mechanisms and we can balance the time taken in each of the two branches through NOP instructions to eliminate timing leaks. We implemented this approach and refer to it as the "branched" approach in the following.
There are multiple reasons to avoid branches in cryptographic software. In our port of NaCl to the AVR architecture described in Hutter and Schwabe [11] , we avoid all secret-data-dependent branches primarily because of the fact that reviewing NOP-balanced branches for timing leaks is tedious work and argued that avoiding such branches incurs only small penalties. Furthermore, secret-data-dependent branches are often an easy target for safe-error attacks. See, for example, Yen and Joye [28] who described these attacks. A careful analysis of different multiplication methods from a side-channel point of view is outside the scope of this paper, but we believe that eliminating secret-data-dependent branches is generally a good practice.
An alternative, branch-free way to perform conditional negation is to use the same technique that we used for constant-time absolute differences above (without the initial subtraction). The additions required to convert from the ones' complement to the two's complement can be merged with the additions that are required to combine the partial results; we simply move the bit to the carry flag and replace one ADD instruction by an ADC instruction.
We recommend the branch-free approach for applications that handle secret data and the slightly faster branched approach for applications that do not handle secret data, e.g., signature verification.
Refined Karatsuba multiplication. In the last step we have to combine the partial results as
Let us consider the case that t = 0; then this computation looks like two n-byte additions and one n-byte subtraction plus rippling a carry bit to the end. However, observe that the byte at position k of the result is obtained as r k = k − m 0 + 0 + h 0 ; the byte at position n is obtained as r n = h 0 − m k + k + h k . What looks like four additions and two subtractions can be reduced to three additions and two subtractions by precomputing s = h 0 + k and then obtaining r k = 0 + s − m 0 and r n = h k + s − m k . The same trick applies to r k+1 and r n+1 and so on and saves a total of k additions. We learned this trick from a Crypto 2009 paper by Bernstein [2, Section 2].
An additional advantage of refined Karatsuba is that we can merge the additions of h 0 + k , h 1 + k+1 , etc. into the multiplication H = A h · B h . This is not an advantage from the point of view of purely arithmetic cost, but it simplifies register allocation as explained in Sect. 4. Note thatH = H + ( k , . . . , n−1 ) cannot overflow; the result fits into n bytes.
However, there is also a slight disadvantage of merging this accumulation of ( k , . . . , n−1 ). The carry bit that may result from the accumulation is immediately rippled into h k , . . . , h n−1 . Later we add ( 0 , . . . , k−1 , h k , . . . , h n−1 ) into the result with an offset of k bytes and subtract (m 0 , . . . , m n−1 ) with the same offset. The addition may produce a carry bit c which needs to be rippled to the end; the subtraction may produce a borrow bit b which needs to be rippled to the end. One can also think of this as a carry bit d = b + c which is either 0, 1, or −1; The fact that this carry bit can be negative is a direct consequence of merging the addition of ( k , . . . , n−1 ) into the multiplication H = A h · B h and rippling the resulting carry. The non-merged computation of ( 0 , . . .
would always produce a non-negative carry, which can simply be rippled to the end.
Merging carries and borrows. If we independently rippled a carry bit c ∈ {0, 1} and a borrow bit b ∈ {−1, 0} to the end of the result, we would essentially lose the arithmetic benefit of refined Karatsuba. What we do instead is to first compute c and then, after subtraction of (m 0 , . . . , m n−1 ), use an SBCI of zero from c to obtain d ∈ {−1, 0, 1} and to set the borrow bit if and only if d = −1. We then perform an SBC of a register f from itself to obtain f ∈ {−1, 0} depending on the value of d. Now the register pair 0) . In the case of the branch-free approach, we first merge c and b into d and then perform a MOV operation of d into f , and apply an ASR instruction afterwards, which arithmetically shifts d to the right, resulting in f ∈ {−1, 0}. After that, we can ripple the carry to the end of the result through one addition of d and then subsequent additionswith-carry (ADC instructions) of f .
Putting it together.
The overall arithmetic cost of (branched) Karatsuba multiplication on AVR is thus composed of the following parts:
• one CLR instruction to produce a zero register;
• the cost of computing L (multiplication of two k-byte integers);
• the cost of computing M (multiplication of two k-byte integers); • the cost of computingH = H + ( k , . . . , n−1 ) (essentially the cost a k-byte integer multiplication and k addition instructions); • 2k + 2 SUB /SBC instructions, 2k EOR instructions, 2 NEG instructions, and 2k ADD/ADC instructions to compute two absolute differences |A − A h | and |B − B h |; • n + 1 ADD /ADC instructions to add ( 0 , . . . , k−1 , h k . . . , h n−1 ) to the result and to remember the carry bit; • one EOR instruction to compute t and to set the zero flag if t = 0; • one BRNE instruction;
• if the branch is not taken (1 cycle for BRNE):
-n + 2 SUB /SBC instructions to subtract M and to produce the carry register pair (d, f ) ; -one RJMP instruction (2 cycles);
• if the branch is taken (2 cycles for BRNE):
-n + 1 ADD/ADC instructions and one CLR instruction to add M and to produce the carry register pair (d, f ); -one NOP instruction;
• k ADD /ADC instructions to ripple the carry in (d, f ) to the end.
In the example of multiplying two 48-bit integers (i.e., k = 3, see also Appendix A) the computation of L and M costs 40 cycles each (cf. Table 1 , the cost is slightly lower because we can replace some CLR instructions by MOVW from a zero register pair; this becomes more efficient for multiple multiplications). The computation ofH = H + ( 3 , . . . , 5 ) costs 44 cycles. Overall, we obtain a cost of 167 cycles from arithmetic (and branch) instructions. This is 20 cycles faster than fully unrolled product-scanning multiplication and 5 cycles faster than our optimized 48-bit multiplication. Note that the overhead from loads and stores in this case is the same for all three approaches: 12 loads of input bytes and 12 stores of outputs; 48-bit Karatsuba multiplication does not need any spills as detailed in Sect. 4. The five-cycle gain is small and probably of merely theoretic interest (in particular because Karatsuba multiplication requires more registers), but the gain becomes larger for bigger inputs.
Efficient scheduling for Karatsuba multiplication
As shown in the previous section, Karatsuba multiplication needs fewer arithmetic instructions than, e.g., fully unrolled product scanning already for very small input sizes. However, it is yet unclear how this arithmetic cost translates to an overall cost including the cost for loads and stores. This section explains our strategies to make efficient use of the available registers and the specifics of the AVR instruction set to keep the overhead from load and store instructions low.
These strategies consist of two levels of optimizations. First, we use carefully tuned instruction scheduling that minimizes the number of live registers throughout the whole Karatsuba multiplication. Second, we use various techniques to avoid costly loads and stores for the cases where not sufficient registers are available despite smart scheduling. Some of these techniques slightly increase the number of arithmetic instructions; the total number of cycles required for Karatsuba multiplication can thus not be obtained by adding the lower bound on arithmetic instructions derived in Sect. 3 to the memory-access overhead explained here. The complete cycle counts for multiprecision multiplication on AVR are reported in Sect. 5. All instruction counts in this section refer to the branched variant of our software.
One level of Karatsuba
For multiplications with input sizes of 48, 64, 80, and 96 bits we use 1 level of Karatsuba. Our approach to scheduling the computations for 1-level Karatsuba multiplication with effects on register usage is detailed in Algorithm 1. Note that the number of registers stated in this algorithm is ignoring some registers, specifically,
• a zero register required to accumulate carries, • registers to hold the borrows from the subtractions in
Step 5, • registers R0 and R1 which hold the result of multiplication instructions, • accumulation registers in the multiplications in Steps 2, 6, and 7, • two registers required to ripple the carry or borrow to the end in Step 11.
Even with these additional registers taken into account, the refined Karatsuba multiplication routines for 48-bit inputs and 64-bit inputs do not require any load and store instructions beyond loading inputs once and storing the result once. What is crucial to make this possible for the 64-bit input case is the computation ofH , i.e., that we accumulate ( k , . . . , n−1 ) on the fly during the multiplication A h · B h . This is possible because we use refined Karatsuba; without this approach the 6k registers would increase to 7k registers and all input sizes starting from 64 bits would need significantly more load and store instructions. For 80-and 96-bit multiplications we cannot entirely avoid memory access beyond loading inputs and storing the result. In the following we describe the techniques we use to keep the overhead from these additional loads and stores as small as possible. We avoid such spills as much as possible by re-loading values that had to be stored anyway as part of the result stores. Specifically, after storing 0 , . . . , k−1 in Step 3, we can "forget" the values in the corresponding registers and reload these values again, when they are needed in Step 8. This only costs k load instructions and no additional stores, and reduces the maximal amount of required registers from 6k + 2 to 5k + 2.
Algorithm 1 Scheduling and register use for n × n 1-level Karatsuba multiplication (notation: k = n/2).

Minimize accumulation registers. The multiplications in
Steps 2, 6, and 7 need registers to accumulate the result coefficients. For the multiplication A · B in Step 2 this is no problem, because the result does not overwrite any of the inputs and simply occupies n "fresh" registers. The optimized versions of small multiprecision multiplications described in Sect. 3 need two additional registers, but this is also not a problem in Step 2. The situation is different in Steps 6 and 7. When using unrolled product scanning, the result coefficients of the multiplication in Step 6 can overwrite k , . . . , n−1 with the low half of result coefficients and one of the inputs with the high half of the result coefficients. The multiplication in Step 7 cannot overwrite any registers for the low half of the result (this is why it temporarily needs additional k accumulation registers), but can overwrite input coefficients with the high half of the result. Overwriting registers that are no longer needed with result coefficients step-by-step is not possible to the same extent with the optimized small multiprecision multiplications. We, therefore, often use fully unrolled product scanning instead of the optimized multiplication variants in Step 6 and 7 to reduce the number of live register variables.
Using the T flag. An AVR-specific optimization is to make use of the T flag. Specifically, the bit t = t A ⊕ t B , which decides whether we need to add or subtract M, does not need to occupy a register. Instead, we can use a BST instruction to store this bit in the T flag and later use a BRTS instruction to branch depending on the value of this bit. The branch-free variant of our software needs to use a BLD instruction to copy this bit back to a register. This is still cheaper than a PUSH and a POP, because writing and reading the T flag costs only 1 cycle each. 
Two levels of Karatsuba
For input sizes of 128, 160, and 192 bits we use two levels of Karatsuba recursion. That means that we use the 1-level Karatsuba multiplication routines described above as building blocks. The general strategy to perform 2-level Karatsuba multiplication is similar to the 1-level Karatsuba multi-Algorithm 2 Scheduling and register use for n × n 2-level Karatsuba multiplication (notation: k = n/2).
Input: A=(a 0 ,...,a n−1 ) and B=(b 0 ,...,b n−1 ), pointers to inputs in register pairs , , pointer to output in register pair Address-pointer handling. For 160-and 192-bit Karatsuba, the input-address pointers have to be spilled to the stack in each 1-level Karatsuba multiplication and they have to be restored from the stack afterwards. This spilling of X and Y is only required for the computation of L and H ; after the computation of M the input addresses are not needed anymore. Spilling would typically require a total of 8 PUSH and 8 POP instructions (i.e., 32 cycles); these are 4 PUSH and 4 POP instructions for each of the two 1-level Karatsuba multiplications. To improve the efficiency, we initially store the address pointers on the stack and load them twice afterwards using two IN , four LDD instructions, and one MOVW instruction. The LDD instructions load the pointer to A from stack into X and the pointer to B into two temporary registers. The MOVW instruction finally copies the pointer from the temporary registers to Y. This saves five cycles in total (needing 4 PUSH instructions, 2 IN instructions, 4 LDD instructions, 1 MOVW instruction, and 4 POP instructions).
We further decided to push X and Y right after the loading of A h in Step 4 in Algorithm 1. There are two reasons for pushing the addresses at this point. First, the X registers already point to the next input address needed in the computation of H , so no additional update of the pointer is needed, e.g., using the ADIW instruction, which would be needed if we pushed the pointer right before the computation of L. Second, after pushing the address on the stack, the register X can be efficiently re-used for storing the input operands of B h . This avoids additional spilling of registers.
When mixing LDD with PUSH and POP, the stack pointer needs to be corrected again at the end of the computation. This can be done by one ADIW instruction and two OUT instructions.
On-the-fly accumulation. As in 1-level Karatsuba, and essentially thanks to refined Karatsuba, we perform an onthe-fly accumulation of ( k , . . . , n−1 ) during the multiplication of H = A h · B h in 2-level Karatsuba. Applying this optimization, however, is not as straightforward as in 1-level Karatsuba, because H itself is computed using 1-level Karatsuba. This makes the accumulation and especially the handling of carry bits more complex.
The main idea to avoid the propagation of carry bits from the accumulation of ( k , . . . , n−1 ) into the multiplication of A h · B h in 2-level Karatsuba is to split the accumulation into two parts of size k/2 each. The first part ( k , . . . , 1 
Three levels of Karatsuba
We implemented the 256-bit multiplication using three levels of Karatsuba. Due to the high register usage of the 1-level and 2-level Karatsuba blocks there is almost no room to hold and re-use registers. Thus, we store all results obtained from the 2-level Karatsuba multiplications in the memory and load the values again at the end of calculation M. Also all absolute differences are pushed to the stack and are popped again during the final 2-level Karatsuba multiplication. The obtained results for M are also pushed to the stack and are popped again at the end of the multiplication.
In total, the 256-bit multiplication needs the following memory instructions: 432 LD/LDD instructions, 196 ST/STD instructions, 82 PUSH instructions, 130 POP instructions, 8 IN instructions, and 32 OUT instructions.
Results
This section reports cycle counts, code size, and stack usage for the software presented in this paper. All cycle counts are obtained through simulation in the Atmel AVR Studio version 5.0.1223. All multiplication routines passed tests on 1,000 random inputs and passed a test with all input bytes set to 255. These tests were performed on an ATmega2560 (Arduino MEGA development board).
Like previous papers we report cycle counts, code sizes, and stack usage excluding the function-call cost, i.e., the cost for CALL, RET, initial PUSH and final POP of caller registers, 3 MOVW instructions required to copy the function arguments to the X, Y, Z registers, and the cost to clear register R1 before returning from the function.
It is important to note that for small input sizes, product scanning does not use all available registers and can avoid some of the PUSHs and POPs of caller registers. A function that only multiplies, e.g., two 48-bit integers and follows the C function-call ABI for AVR will thus be faster when using product scanning than our Karatsuba multiplication. However, the 48-bit Karatsuba multiplication will be faster if it is used in a larger (inlined) context. See also Sect. 4 .
A summary of our results, together with the best previous results from the literature, is presented in Table 2 . All implementations listed in this table focus on speed and are fully unrolled. For input sizes from 48 to 96 bits we are not aware of any results from the literature achieving better speeds than fully unrolled product-scanning multiplication. For those input sizes we include a comparison with fully unrolled product scanning. For 48-bit inputs this is not optimal as demonstrated by our optimized multiplication routine (see Sect. 3 and Appendix B). We believe that also for 64-bit, 80-bit, and 96-bit inputs, careful optimization of quadraticcomplexity multiplication can gain a few cycles compared to fully unrolled product scanning. However, we do not expect those gains to be larger than what we gain using subquadraticcomplexity Karatsuba multiplication.
The software described in [12] is available through an online code generator at http://mhutter.org/research/avr# mulopcache. The cycle counts of the software generated by this online tool are slightly lower than the ones reported in the paper. We compare to the improved cycle counts. From the authors of [24] , we received 128-bit and 160-bit consecutive-operand-caching multiplication routines, which are slightly faster than the numbers listed in their paper. We also compare to the improved cycle counts.
Conclusion and future work
In this paper we presented new speed records for multiplication of integers from 48 bits up to 256 bits on AVR ATmega. We showed that carefully optimized Karatsuba multiplication technique is more efficient than quadratic-complexity multiplication already for much smaller input sizes than previously believed.
The most obvious future work is to apply the multiplication routines described in this paper to elliptic-curve cryptography. For example, Liu, Seo, Großschädl, and Kim in [19] use consecutive operand-caching multiplication to push the performance boundaries for arithmetic on the NIST-P192 curve. It will be straightforward to push the boundaries even further by replacing consecutive operand-caching multiplication by our Karatsuba multiplication routines.
Furthermore, this paper focuses on speed of multiplication routines without considering the size of the implementation. It will be interesting to investigate tradeoffs between speed and size for Karatsuba multiplication on AVR, for example, by implementing the small multiprecision multiplications at the bottom of the recursion only once and use jumps or calls to this routine. Another direction of future research is to examine whether the Karatsuba technique can also speed up squaring on AVR. Finally, we hope that the techniques described in this paper will serve as an inspiration to re-examine possible performance gains from Karatsuba multiplication for relatively small inputs on other embedded platforms.
Listing 5 Optimized multiplication of two 48-bit integers, input A in registers R2,R3,R4,R5,R6,R7; input B in registers R8,R9,R10,R11,R12,R13; result in registers R14, R15,R16,R17,R18,R19,R20,R21,R22,R23,R24,R25.
CLR R20 CLR R21 MOVW R22, R20 MOVW R24, R20
