Abstract-When mapping public-key algorithms, such as RSA, onto constrained devices, both efficiency and flexibility are a challenge. Because word lengths are large, minimum 1024 bits, typically a dedicated co-processor is used. On the other hand, flexibility is required, because designers want to support a variety of RSA exponentiation algorithms. Typically the solution is then a hardware/software (HW/SW) co-design platform. In this paper we have chosen this approach: we use an 8051 micro-controller for flexibility and a Montgomery multiplier for efficiency. However, the importance of the interface between HW and SW is often neglected. The main focus of this paper is therefore to propose an interface that supports maximally the flexibility and the efficiency. We use this interface to compare six different exponentiation variants of RSA with and without side-channel attack countermeasures.
I. INTRODUCTION
Public key cryptography (PKC) offers important functionality to cryptographic systems. A drawback is that implementations of PKC usually are demanding in terms of computational power. Hence, it is a challenge to implement them on constrained devices like smartcards. Pure software implementations of public key cryptographic systems exist [1] , [2] . However, they are still slower than a hardware version. A common solution to enable public key cryptography on constrained devices is to add a small co-processor. The attached hardware has the disadvantage that it cannot be updated later and thus lacks flexibility. However, carefully distributing tasks between hardware and software can lead to a flexible and fast overall design. Being able to change the system later on is important for many reasons, e.g., to increase the key size to adapt to stronger computational power of an attacker or to apply countermeasures to sidechannel attacks. This paper will focus on flexibility directed to side-channel resistance. It is certainly possible to change the design to also be flexible in terms of key size, though we aim for a minimal approach to keep the design simple and small. We set our key length to 1024-bit as this size offers easy comparison with existing implementations. Popular public key cryptographic systems are RSA [3] or ECC [4] . RSA is currently more used than ECC, though this might change in the future due to the attractive small key sizes of ECC. In this paper we will compare different algorithms for the RSA system. Many attacks and countermeasures for RSA implementations have been published and it remains an active field of research. Many of these algorithms make changes only to the high-level protocol of RSA, where the underlying long number arithmetic remains untouched. This makes RSA a good candidate for a comparison of multiple algorithms whereas a small co-processor can be reused for different algorithms.
As mentioned before the design shall be flexible such that is allows to reuse the co-processor for various RSA algorithms. This requires that the high-level protocol part of RSA, i.e., the exponentiation strategy, remains in software. The chosen RSA exponentiations algorithms differ in complexity of control flow. For the described comparison we aim for a HW/SW co-design from which a more complex control flow does not result in a large overhead. In order to achieve this degree of flexibility the design needs to match certain criteria.
• A low-overhead interface between hardware and software. The limited bandwidth of the interface is likely to become a bottleneck if RSA is carried out partly in software and partly in hardware.
• The interface shall also allow hardware and software to run asynchronously whenever data dependencies allow. This allows more complex actions in the exponentiation algorithm while the hardware is busy with multiplications. Idle times of the co-processor need to be avoided for all algorithms if possible.
In this paper we contribute three things:
• As first contribution, we compare RSA algorithms in terms of execution time and their resistance to sidechannel attacks. We chose several classic RSA algorithms as well as more advanced variants, which are resistant to SPA, DPA, or fault attacks. Due to the fact that the co-design approach is a common strategy in the area of constrained devices the authors are convinced that the generated data is meaningful and the earlier defined design criteria allow for a fair comparison.
• The second contribution of this paper is the design of the architecture which fulfills the above mentioned main criteria. For the sake of completeness we compare our design to existing designs. Recall, that the Montgomery multiplier of this work can be exchanged to meet certain speed or area restriction. The Montgomery multiplier of this work has not been optimized as it would not affect the comparison of the different RSA algorithms.
• As we use Montgomery multiplication for all exponentiations we adapted the RSA algorithms presented in [5] for usage with Montgomery. We contribute a Montgomery version of this algorithm. See section II and appendix A for details.
Even though HW/SW co-design is a common approach for constraint devices there are not many published results for RSA. None of the published co-designs takes sidechannel countermeasures on algorithm level into account or implements more than one algorithm for RSA.
A design which also targets the 8051 micro-controller is the co-design implementation by Sakiyama et al. [6] from 2007. Contrary to our approach Sakiyama et al. are optimizing the co-processor to find the best trade-off between area and size for the given interface. The earliest RSA co-design reported comes fromŠimka et al. [7] in 2003. They use a 16-bit Nios embedded processor from Altera. The paper presents measurement results for a co-design with variable amount of 16-bit processing units applying carry propagating adders.
The remainder of this paper is organized as follows: in the next section we give an overview of the selected RSA algorithms which we implement. Subsequently, we describe in detail our design approach and settings. In section IV, we present an overview of the performance of the algorithms and their comparison to other implementations. Finally, we conclude in section V.
II. ALGORITHMS
In the following we describe the algorithms we have implemented. Note that we do not test or verify the authors assumptions regarding side channel resistance of the different algorithms.
All algorithms we implement use Montgomery's algorithm (see algorithm 1) to perform modular multiplication. The latter has the advantage to not require a costly inversion for the reduction. As drawback it requires a domain change at the beginning and at the end of the exponentiation. For simple exponentiation the overhead caused by the domain changes is negligible, but for more complex variations it can be significant, see further.
Algorithm 1 Montgomery multiplication
Input: a,b, n Output: c= a·b R mod n with R = 2
5:
A ← (A + n) ≫ 1 6:
A ← A ≫ 1
8:
end if 9: end for 10: if A >= n then 11:
Algorithm 2 Chinese remainder theorem
Input: m, p, q, n = p · q, dp = d mod p, dq mod q, Acrt = q −1 modp
The binary left to right (L2R) and the binary right to left (R2L) algorithm are straightforward approaches for exponentiation. We include the L2R variant into our portfolio for comparison reason only. We would like to refer the interested reader to [8] for further details.
The Montgomery powering ladder (MPL) [9] is an exponentiation especially adapted to the Montgomery multiplication. It offers resistance against simple power analysis (SPA) [10] but remains vulnerable to differential power analysis (DPA) [10] .
Applying the Chinese remainder theorem (CRT) to RSA is a common speed up technique. Instead of one exponentiation it does two exponentiations with half the parameter size. The results are then recombined according to Garner [11] (see algorithm 2). This technique is a high level optimization which can be applied to many exponentation algorithms, e.g., left to right (referred to as L2R-CRT). Exponentiation with operands of half the size gives a speedup of a factor four, but as two exponentiations are required the speedup results in a factor of two. As a fourth improvement, the two exponentiations can be carried out in parallel, thus allowing an overall theoretical performance gain of a factor four. Our implementation benefits from the smaller operand size, but not from parallel execution. This is, so the original data path can be reused and to keep the design small. Hence, the powering ladder results in a theoretic speedup of factor two in our design. Of course there will be an overhead due to additional control flow.
The blinded fault resistant exponentiation (BFR) by Boscher et al. is a more advanced RSA variant, which offers resistance to SPA, DPA and fault attacks, see algorithm 3. For a detailed discussion of the countermeasures we would like to refer the reader to [5] . The algorithm can be combined with CRT to improve performance, see algorithm 4. Both
6:
: Input: m, dp, dq, Acrt Output:
Return(r −1 · S1 mod n)
8: else 9:
Return("Error")
10: end if
BFR variants require a 1024 bit random number and its modular inverse. We generate both numbers as suggested by the authors with our Montgomery multiplier from a small random number (this is included in our profiling). We assume for a real world scenario that the embedded platform has a source for random numbers. This could be another co-processor. We implement both variants, with and without CRT. One issue we would like to address are additional domain changes besides the domain change in the beginning and in the end of an exponentiation. Especially the BFR-CRT algorithm makes extensive use of different domains when used with an underlying Montgomery multiplier. This must be addressed to avoid additional overhead. We present a Montgomery version of the BFR-CRT RSA algorithm in the appendix. For the rest of this paper we refer to the Montgomery optimized version as BFT-CRT algorithm.
An overview of all algorithms summarizing their sidechannel attack resistances along with their complexity in terms of required number of multiplication for a 1024 bit exponentiation is provided in section IV, table I.
III. DESIGN
In this sections, we first describe the platform on which we implement. Thereafter, we discuss the trade-off between flexibility, size, and performance for the interface. Subsequently, we present our design regarding hardware, software, interface, and memory management.
A. Specifications
The 8051 microcontroller [12] is widely used, low cost, small and thus a good choice for our co-design. The microcontroller usually runs at 12 MHz, offers 4 ports for data exchange, has 128 bytes of RAM, and supports up to 64 kb of external RAM. Data can be shared with dedicated hardware via the ports and via the external RAM. For transmitting a single byte, the ports are faster than the shared RAM. However, a handshake is required for each byte. This is due to the fact that co-processor and microcontroller run at different clock speeds. The external memory can be shared partly or completely with the coprocessor. Once the hardware knows the address for the memory access consecutive bytes can be read respectively written. The shared memory will be our preferred way to exchange data and also instructions.
We implement the software in C and Assembly. The small device C compiler (SDCC) [13] is used to compile and link the program. The software simulation is performed using the MCU8051IDE [14] . Our hardware is described in the GEZEL [15] language. Finally, the HW/SW co-design is simulated with the gplatform tool, which is part of the GEZEL tool chain.
B. Interface Trade-offs
Considering size it is the case that due to the long-number operands of RSA (1024-bits in our case) adders, registers, and multiplexers are the main contributors for the size of the co-processor. Thus, the co-processor used in this work uses the minimum number of registers while the design is kept flexible and fast. Recall that our design, regarding the interface, prioritizes flexibility over speed and size. As mentioned before flexibility shall be achieved in a way that different RSA algorithms can be implemented without changing the hardware. The most time consuming operation in RSA is the exponentiation, which is in all the chosen RSA algorithms a sequence of modular long number multiplications. Thus, a general purpose modular multiplier is a good choice for a co-processor. Specifically, we use a Montgomery [16] based multiplier. This kind of multiplier has been implemented many times for different goals regarding size and speed and is no priority of optimization in this work.
• Besides the Montgomery multiplication the coprocessor is able to perform addition and subtraction as the Montgomery multiplication is based on these operations. The latter are made available for the software via the interface at additional cost of multiplexers in hardware. The different RSA algorithms can now invoke the co-processor to perform Montgomery multiplication, addition, and subtraction as required for the different strategies of carrying out the exponentiation. Especially the more advanced algorithms benefit from the separately available addition and subtraction. 
C. Implementation
Implementation and optimization of hardware, software and the interface is done with respect to the defined design criteria.
1) Interface and control flow between hardware and software: As explained before, our interface shares data and commands via the shared RAM as shown in figure 1 . We achieve asynchronous, parallel execution via a first in first out (FIFO) instruction pipeline in the shared memory. This greatly helps avoiding idle time. Some commands, e.g., load, are executed very fast by the co-processor, whereas others, e.g., multiplication, take much longer time. This variation in time for different actions also holds for the software carrying out the high level protocol. By queuing up commands in a pipeline idle times can be avoided. The interface is illustrated in figure 1 . The pipeline consists of several consecutive bytes in the external memory. The protocol only needs to make sure the hardware waits once it executed the latest instruction. Also a reset command is required in case the end of the pipeline is reached. The pipeline's work flow is as follows. The software fills the pipeline with commands, which are of size of one byte. Instructions are represented by values different to zero or in other words, the zero resembles a wait instruction. The hardware reads the pipeline continuously and executes an instruction as soon as it appears in the pipeline. The hardware then reads continuously the next element of the pipeline and so on. This way no handshakes are required even if hard and software run at different frequencies. Exceptions occur when the end of the pipeline is reached or when the algorithm requires conditional execution based on an intermediate value. The software can reset the pipeline and synchronize execution with a special instruction. However, the reset and synchronize instruction is required rarely. Limiting the instruction size to one byte improves the throughput of instructions and helps overcoming the interface as bottleneck. See fig. 2 (a) for a list of instructions. Load and store are the only instructions that can access the shared memory. They consists of a 2-bit opcode and two 2-bit codes to select any of the registers in the co-processor. Further, a code defines which pointer shall be used to address the memory. The arithmetic instructions can for flexibility reasons operate with any operand combination possible. Jumpback refers to the reset and synchronize instruction. SetL is used to change mode of operation of the hardware to half the operand size, as required for CRT optimization.
2) Pointers and data flow between software and hardware: Data is shared via the external memory, hence the hardware needs to know where to find it. As can be seen from the input list of the algorithms several precalculated values are required. Also intermediate results occasionally need to be stored in the external memory. Limiting instructions to one byte makes it quite difficult to pass the according pointers to the hardware, as a pointer to external memory is two byte long in the 8051. Writing pointers in the pipeline would also cause a second problem. Hardware and software run asynchronously and in parallel, thus a pointer must not be changed once it is written to the pipeline. As a consequence software designers would need to apply memory copies or force synchronizations of hardware and software frequently. We overcome these issues by implementing a pointer buffer. The pointer buffer is a block in the shared memory, which can hold 16 pointers, see figure 2(b). A synchronization between hardware and software is thus only required after the 16 different pointers have been used and one or more pointers have to be overwritten in order to access new variables. The 16 different locations of the buffer can be encoded with four bits within an instruction. The starting locations of the pointer buffer is shared with the co-processor during the initialization just like the pipeline.
3) Software: Recall, that the higher level parts of the algorithms are carried out in software. We implement this Smaller parts of lower level functionally are for the sake of size and complexity not carried out by the co-processor. These are tasks like finding the first non-zero byte in the exponent, deciding whether or not an operand is zero, or determining whether or not the next bit of the exponent is a one. Some of these routines can be solved especially easy and efficient by applying inline assembly. For example determining whether or not the next bit is zero or one. This task is rather unhandy in C, whereas rotating a byte through the carry is a single assembly instruction only and gives direct access to the next bit. Profiling shows that currently the software scans an exponent and finishes to fill the pipeline three times faster than the hardware can execute instructions. However, in case a faster hardware multiplier would be applied there is certainly room for further optimizations in software.
4) Hardware:
We have chosen to implement 1024 bit RSA, but our co-processor can easily be extended for support of 2048 bit or bigger key sizes. The hardware of the coprocessor is organized in five blocks, the interface, the long integer carry-save adder (CSA), the Montgomery multiplier, and a block to recombine the sum and carry variable of the CSA ripple. It is a word-serial carry ripple adder working with 32-bit words. The modular addition/subtraction is implemented reusing the serial adder used for recombination of the CSA.
Only one long integer adder is used. The adder has been implemented in a carry save fashion. The parallel structure of the adder allows to calculate the sum of two large integers within one clock cycle. The adder allows inputting an initial carry such that the adder can also be used for two's complement subtractions. The multiplier calls the adder repeatedly in order to perform a multiplication according to algorithm 1. The final block is the controller which interfaces with the software. The controller reads and decodes the instructions and initiates the operations of the other blocks.
The architecture of the co-processor is illustrated in figure 3 . The arithmetic components are about the same size as the components used for interfacing with the software, both about 13700 LUTs. 
IV. RESULTS
Flexibility is difficult to measure, especially as there are no other reported implementations with the goal of flexibility regarding the RSA exponentiation. We will compare the algorithms with each other with special respect to their overhead due to data transfer and control flow over the interface. Further we compare the complete RSA runtimes with other RSA implementations in hardware, software and HW/SW co-design. However, the overall runtime and size also depends on the Montgomery multiplier which is not primary optimization target of this work.
A. Overhead analysis
In figure 4 , we show the performance of the different RSA algorithms. The cycle counts are simulated by the gplatform simulator [15] . Figure 4 consists of two plots. The bottom plot shows overall executed clock cycles. As the overhead is rather small in relation to the multiplications we additionally plotted the bars (top plot) excluding the cycles for multiplications to enhance visibility. We separate each bar into the actions Multiplication, Domain Change, Handshake, and Interface. This partitioning is done by profiling each of the according actions, except the interface. All profiled actions are then multiplied with the number of occurrences during the whole algorithm. The number of occurrences is derived from the C code by simply counting the calls. The interface overhead is defined as the remainder from the total cycles. It mainly consist of idle time of the hardware and also data transfers. Fig. 4 shows that the straight forward algorithms L2R and R2L have a negligible overhead as they only require a minimum of control flow and data transfer. As can be expected the CRT enhanced implementations spent less time multiplying, while spending more time in other arithmetics and interfacing for the recombination. Certainly the overall execution time greatly benefits from this. The more complex versions, like the blinded fault resistant exponentiation, perform more steps besides exponentiations to fulfill their security claims. Especially extra domain jumps for the Montgomery multiplications and also extra required synchronizations are causing overhead. The BFR-CRT exponentiation mixes operands in four different domains, i.e., the p-domain, the q-domain, the n-domain and the normal domain. Transferring one 128 byte operand between HW and SW (128 cycles) takes more than 8 times less than one domain change of an 128 byte operand (1044 resp. 2088 cycles). Obviously, domain changes have to deserve higher priority than load/store operation during optimization. We make advantage of the fact that giving the Montgomery multiplier one value in a certain domain and the other value in normal domain the result will also be in normal domain. We provide a Montgomery version of the blinded fault resistant exponentiation, which is optimized for domain changes in Appendix A. However, the BFR-CRT offer the highest security levels, but also causes the biggest overhead. Still, the performance of the BFR-CRT algorithm is close to the MPL, whereas MPL only offer resistance to simple power analysis and the BFR-CRT additionally offer resistance to differential power analysis and also fault attacks. The overhead causes of the BFT-CRT causes 34% higher execution time compared to the straight forward approaches.
B. RSA comparison
In table I, we summarize overall speed along with the side-channel properties. The timings are simulated using the Gezel cycle accurate RTL simulation tool. As can be expected, the L2R and the R2L exponentiation have almost identical runtime. While L2R and R2L are square and multiply tactics the Montgomery ladder follows a square and always multiply strategy and thus requires more time for multiplications. Hence, the Montgomery powering ladder runs around 25% slower but has the advantage of being SPA resistant. CRT is commonly used for speedup and applying it to the R2L algorithm does not give additional security but enhances performance by 42%. The BFR exponentiation also follows a square and always multiply approach. In fig. 4 this can be seen clearly as the time spent for multiplications is similar to the MPL. The BFR algorithm is resistant to many attacks (SPA, DPA, FA) but is also the slowest algorithm. However, applying CRT to it gives a speedup of 38% and results in similar performance to the L2R and R2L algorithms. Compared to the L2R-CRT exponentiation the BFR-CRT exponentiation shows a bigger overhead.
C. Comparison with existing implementations
Comparison is difficult due to the various different platforms. We selected implementations from 8-bit platforms for software implementations and FPGAs for hardware and codesign implementations. Many papers measure performance results in time. As the underlying micro-controller or FPGA influences the performance greatly the authors are convinced that clock cycles are a better measure for comparison. However, for co-design and hardware implementations a long data path can limit the maximum frequency. Therefore we also include runtime and frequency in the performance table. Where necessary we calculate the clock cycles from time and frequency or vice versa. Note that a smartcard will most likely not run at high frequency and we assume that the 12 MHz of the 8051 are a good reference. Hence, especially the pure hardware implementations are less suitable for comparison. We will subsequently compare software implementations, co-design implementations, and hardware implementations to the result of this work following table II.
The implementation of Gura et al. [19] belongs to the fastest published software implementations. Further the authors provide results for hardware acceleration by a tiny instruction set extension (ISE). The ISE give a speedup of 38%. However, our hardware acceleration is still 32 times faster than the ISE. Liu et al. [20] present a state of the art software implementation with further speed improvements over Gura et al. . Besides the latter the authors also provide a version, which applies countermeasures against SPA. However, it cannot fulfill the demand for DPA resistance and fault attack resistance. The fastest result of this work without countermeasures requires factor 40 less cycles. Comparing their SPA resistant version to our BFR-CRT implementation this work is factor 25 faster and offers higher security against side-channel attacks.
The co-design implementation of Sakiyama et al. [6] suits for a direct comparison as it also uses the 8051 microcontroller with a co-processor for acceleration. The exponentiation strategy is a binary left to right exponentiation. It can be seen from the table that our solution with this algorithm is slightly faster even though one of our registers remains unused for this algorithm. A comparison with equal exponentiation show that this work is slightly faster. Hence the CRT algorithm shows a performance Table I  PERFORMANCE OVERVIEW FOR ONE 1024-BIT EXPONENTIATION advantage of around 25%. Comparing with BFR-CRT we still achieve similar speed as [6] but apply countermeasures to side-channel and fault attacks. The co-design ofŠimka et al. [7] is implementing different levels of parallelism. We have chosen the fastest result and the best time area product for comparison. The fastest result of this work is factor two faster than the fastest result of [7] and factor 8 faster than the best time area product even though we do not apply parallelism. The more secure BFR-CRT algorithm is slightly faster than the fastest version and factor 3.7 faster than the best time area product.
The hardware implementation of Wang et al. [18] aims for high performance, especially high frequency. Their paper underlines the influence of hardware on execution time and also area by synthesizing their implementation for multiple different FPGAs. Certainly the clock cycle counts barely differ for the different platforms. Comparing the cycle count we achieve similar speed with the CRT implementation even though we carry out the high part of protocol in software.
Mentens et al. [17] present a state of the art Montgomery multiplier for FPGAs. It is a pipelined architecture applying the 8-bit multiplication units supplied by the FRGA. Two exponentation strategies are implemented for their multiplier, a straight forward L2R and the faster k-ary method. For equal exponentiation strategy their work id factor 6 faster in regarding the cycle count.
V. CONCLUSION
We presented a flexible architecture with a low-overhead interface to accelerate algorithms based on modular long number multiplication. It provides efficient control and data flow between hardware and software and allows both parts to perform their tasks in parallel. Due to these properties we made a fair comparison of RSA algorithms with different countermeasures to side-channel attacks. The comparison demonstrates that algorithms with additional security suffer in performance but can compete with straightforward approaches. Especially the BFR-CRT offers very high security level with only 34% overhead compared to the straight forward approach. It shows similar performance as the widely used Montgomery powering ladder, which offers only SPA resistance. Existing devices that apply a HW Montgomery multiplier with a flexible interface can be upgraded to a higher security level if the Montgomery multiplier meets the required amount of registers. E.g., Montgomery ladder algorithms use 3 registers and can thus be changed to a BFR or, if smaller word sizes are supported by the multiplier into a BFR-CRT variant. The latter variant wold come with little performance loss only.
APPENDIX
The blinded fault resistant exponentiation with CRT frequently changes domains. As a the commonly used Mont-gomery multiplication would suffer from this a version of the BFR-CRT exponentiation strategy has been created which is optimized for domain changes. Implementation results can be found in section IV.
We refer to values in n-domain with bar, values in pdomain with tilde, and values in q-domains with hat. A single domain change requires a full multiplication. Further, a domain change can also be achieved by multiplying values which are in different domains. For example, multiplying a value in n-domain with a value in normal domain results in a value in normal domain.
Algorithm 5 BFR-CRT for Montgomery multiplication
Input: q , A n-domain:rinv p-domain:m,r,rinv q-domain:m,r,rinv Output: m d mod n or "Error"
