Ahstract-The energy cost of asymmetric cryptography, a vital component of modern secure communications, inhibits its wide spread adoption within the ultra-low energy regimes such as Implantable Medical Devices (IMDs), Wireless Sensor Networks (WSNs), and Radio Frequency Identification tags (RFIDs). Con sequently, a gamut of hardware/software acceleration techniques exists to alleviate this energy burden. In this paper, we explore this design space, estimating the energy consumption for three levels of acceleration across the commercial security spectrum. First we examine an efficient baseline architecture centered around a pipelined RISC processor. We then include simple, yet beneficial instruction set extensions to our microarchitecture and evaluate the improvement in terms of energy per operation compared to baseline. Finally, we introduce a novel, dedicated accelerator to our microarchitecture and measure the energy per operation against the baseline and the ISA extensions. For ISA extensions, we show between 1.28 to 1.41 factor improvement in energy efficiency over baseline, while for full acceleration we demonstrate a 4.36 to 6.45 factor improvement.
I. INTRODUCTION
Encryption in the ultra-low energy domain is an impor tant and growing challenge. Example applications include Implantable Medical Devices (lMDs) [1], [2] , Wireless Sensor Networks (WSNs) [3] , and Radio Frequency Identification (RFID) tags [4] , [5] . In this class of applications, the energy cost of each operation is paramount to the device's utility. For example, in a typical IMD, each extra Joule expended in computation reduces the life of the device, and each surgical replacement of the device endangers the life of the patient. Security is also of critical importance in this application. Reexamining the case of IMDs, unauthorized access to an implanted cardiac defibrillator's programming interface poses an unambiguous threat to the patent's health and privacy. Despite the obvious need for security in this domain, relatively few designs have incorporated encryption; among these, most employ symmetric (shared-key) encryption techniques [1] . More secure schemes for communication exist that involve asymmetric cryptography. However, the high computational cost of asymmetric cryptography has put these schemes out of reach for ultra-low energy applications. In this paper we examine microarchitectures to facilitate the use of asymmetric cryptography for ultra-low energy applications, dramatically reducing the energy required per key exchange operation, with the goal of placing asymmetric cryptography within the energy and power envelopes of IMDs, WSNs and RFIDs.
978-1 -4799-3606-9/1 4/$31 .00 ©201 4 IEEE 55
Reconfigurability/Generality Asymmetric cryptography, also known as public key cryp tography, has become an essential component in modern, secure communications. Unlike its symmetric counterpart, asymmetric cryptography requires separate keys for encryp tion and decryption, allowing it to solve a host of security challenges not possible with symmetric cryptography alone. Uses for asymmetric cryptography range from session key establishment for secure communications to digital signatures for message authenticity and non-repudiation. While sym metric cryptography is based on data shifts and permuta tions, asymmetric cryptography is built upon a foundation of mathematically hard problems. As a result, the computational requirements for asymmetric cryptography are far greater than that of symmetric cryptography [6] .
Employing asymmetric cryptographic capabilities on ultra low energy devices can be especially challenging [3] - [5] . Wan der et al. found, in the WSN domain, even weak asymmetric cryptography (160-bit ECC, equivalent to 1024-bit RSA) con sumes approximately 72% of the energy allotted for commu nication handshaking. Moreover, they assume that only 5% to 10% of a WSN's energy budget is available for handshakes [7] . For RFID tags, it is difficult to quantify the energy budget for encryption; however, because most tags are passive energy harvesters, the budget is significantly less than that of a WSN node. Implantable Medical Devices (IMDs) pose additional and potentially life-threatening security challenges. Like WSN nodes, IMDs are battery powered devices, which must securely communicate with a basestation [2] . Public-key encryption would enable the use of traditional SSL handshaking between the IMD and the basestation but would cause a considerable degradation in battery life.
In this ultra-low energy domain, a spectrum of hardware/ software acceleration techniques exists, in which reconfigura bility is traded for increased energy efficiency as the amount of hardware acceleration is increased. Figure 1 depicts this trade off with compiled software executing on a power-conscious processor on one side and a fully dedicated cryptographic processor on the other. The more interesting research lies in the middle, where some degree of reconfigurability is maintained while the energy consumed per operation is much less than pure software implementations. This area is precisely the portion of the spectrum this paper attempts to capture.
In this paper we first present, for comparison, a baseline architecture, representing the left-most side of Figure 1 , con sisting of a low-power RISC processor and a minimal memory layout, typical of an embedded microcontroller. This baseline design is evaluated in terms of energy per cryptographic operation. The cryptographic operation used for evaluation is a signature followed by a verify defined in the Elliptic Curve Digital Signature Algorithm (ECDSA) [8] , [9] . Moving to the right within Figure 1 , some simple yet effective instruction set extensions are added to our baseline architecture and evalu ated in terms of energy per operation. Finally, a microcoded accelerator, designed for finite field arithmetic, is added to the architecture and similarly evaluated. The contributions of this paper are as follows:
• Detailed power, energy and performance analysis of ultra-low energy asymmetric cryptography for three levels of hardware acceleration within the same tech nology node, using the same experimental techniques.
• Design space exploration across a range of Elliptic Curve Cryptography (ECC) key-sizes that includes 384-bit, providing insight into current and future se cure data exchange for embedded systems.
• Development of a novel, microcoded, state machine based, finite field arithmetic acceleration unit which maintains reconfigurability (via microcode program ming) while dramatically decreasing the energy per digital signature.
II. BACKGROUND
In this section we refresh the reader's understanding of the relevant background topics for this study. We first discuss the mathematics that underpin all asymmetric cryptosystems. We then review energy and power in CMOS devices.
A. Underlying Mathematics
Asymmetric cryptography is based on the one-way func tion, a mathematical function that has a computationally feasi ble forward operation and a reverse operation presumed to be computationally infeasible. One-way functions for asymmetric cryptography are generally built using finite field operations as building blocks. Modular arithmetic, including modular addition, subtraction, and inversion, as well as modular multi plication, constitute computation from a class of finite fields, 56 referred to as prime fields and denoted here as GF(p). It should be noted that prime fields are especially important and commonly used for asymmetric cryptography, though other fields may be used as well.
First generation public-key cryptosystems such as RSA, Diffie-Hellman, and the Digital Signature Algorithm (DSA) utilize modular exponentiation (i.e. y = gX mod p) as the one way function [9] - [11] . The brute-force method for computing modular exponentiation is to multiply g by itself x times, but far more efficient techniques exist, such as the suite of repeated square-and-multiply algorithms. Each square or multiply in modular exponentiation is an operation performed over a finite field. Assuming a 4096-bit RSA algorithm, on the order of 1.5 * 4096 field operation, each of size 4096 bits, must be performed for each modular exponentiation. The reverse operation, compute x given y, g, p, referred to as the Discrete Logarithm Problem (DLP), is considered intractable as the size of the modulus increases. Methods considerably more efficient than brute-force, however, have been found to compute the DLP [6] . Thus, very large integer keys must be used to ensure security with traditional public-key cryptosystems based upon modular exponentiation.
The successor to modular exponentiation-based schemes is Elliptic Curve Cryptography (ECC), which utilizes scalar point multiplication over elliptic curves [12] . Scalar point multipli cation involves repeated addition-and-doubling of points on an elliptic curve defined over a finite field and can serve as a replacement for modular exponentiation in many cryptographic protocols. As with modular exponentiation, the reverse opera tion, known as the Elliptic Curve Discrete Logarithm Problem (ECDLP), is considered intractable.
Determining the number of finite field operations for ECC is not as straightforward as it is for RSA because each ECC add or double encompasses potentially dozens of finite field operations. Given the same key size, there is an order of magnitude more field operations for a typical ECC scalar point multiplication compared to an RSA modular exponen tiation, but the advantage of elliptic curves over modular exponentiation for asymmetric cryptography is that the ECDLP is considered to be computationally harder than the DLP. Consequently, the size of integers used for ECC is much smaller than that of modular exponentiation-based schemes of equivalent security. For this reason, ECC is substantially more energy efficient than modular exponentiation schemes for the same level of security and is the only asymmetric cryptosystem evaluated in this study [7] , [l3]. Given existing computational capabilities, integer computation in the range of 192-bits to 384-bits maintains adequate security for ECC. To provide similar levels of security, RSA would need 1024-bit to 15360-bit computations [14] .
B. Energy Consumption in Digital Circuits
Understanding how energy is consumed in CMOS logic is key to creating energy efficient designs. The general equation for energy is given by
such that Power is the average computation power, and fl.Time is the time per operation. While fl.Time is dependent upon the computation time, P ower is dependent upon the CMOS implementation and usage.
CMOS circuits dissipate power in three different ways. First, there is static power dissipation, which can be described by the formula below:
where V is the source voltage and Ileak is source to drain current when the transistor is turned off, referred to as leakage current. The second type of energy consumption is switching power, given by the following formula:
C is the capacitance the transistors must drive and is made up of wire and gate capacitance. The clock frequency, f, and the switching activity factor, a, capture the rate at which the transistors switch. The third component of power is short circuit power and is given by the following formula:
Ise is the short circuit current which exists between the type N and P transistors during a logic state transition [15] .
In computing, we can reduce energy per operation by either reducing the power consumed in the computation logic or by reducing the amount of time required per operation. Often, a small increase in power can be traded for a significant reduction in execution time such that there is an overall benefit in energy conservation [16] . Conversely, an increase in execution time might be traded for a significant reduction in power as seen with Dynamic Voltage Frequency Scaling (DVFS) [17] .
III. RELATED WORK
Researchers have dedicated much effort to achieving signif icant acceleration using hardware in FPGA and ASIC designs; however, only a few publications seem to investigate the energy consumption aspect of public key cryptography for embedded devices. In order for public key cryptography to be viable in energy-constrained applications, a better understanding of the energy cost associated with asymmetric encryption in both hardware and software is necessary.
Wander et al. compared the energy cost of 1024-bit RSA with that of 160-bit ECC to show that 160-bit ECC signifi cantly reduces energy consumption when executed on an 8-bit Atmel AT mega128L microprocessor [7] . The results provide a very compelling argument for ECC, showing that, based on an assumed battery life, the device using ECC could execute 4.2 times the number of key exchange operations. While their work looked at the energy cost for asymmetric cryptography on the far left side of the range shown in Figure 1 , our work examines its cost for two additional points on the spectrum.
Also on the far left of the spectrum, Potlapally et al. investigated the energy requirements of OpenSSL on an Intel SA-lllO StrongARM processor [13] . To do so, they devised a Lab VIEW based testbed that measures the power consumption of a handheld device with the SA -Ill 0 processor in real time. Their experimental results motivate further research by showing that for IKE data transfers, asymmetric cryptography consumes greater than 90% of the total energy spent on cryptographic processing. This equates to 56% of the total energy expended during the data transfer. Additionally, they show that 163-bit ECC requires less energy than 1024-bit RSA when client authentication is utilized.
Keller et al. examined the public-key energy consumption for FPGAs [18] . First, the design of an entire asymmetric cryptographic processor is explained. Then, the design is implemented on an Xilinx Spartan 3E FPGA and character ized in terms of its energy consumption. The processor is capable of utilizing binary or prime finite fields. For prime field mathematics, the authors used 192-bit integers, while for binary mathematics, 163-bit polynomials were used. For energy consumption characterization, the authors kept the bit lengths the same but made various algorithmic changes. They found that the power consumption of the FPGA remained quite constant throughout their experimentation, and thus, the fastest system configuration was also the most energy efficient. In the design by Keller et aI., the field size was fixed at synthesis time, placing it on the far right of the spectrum of Figure 1 . By contrast, the accelerator presented here is run-time configurable for up to 384-bit ECC. Furthermore, our work evaluates the energy cost for ASIC technology as opposed to FPGA logic, which presents a significantly different power performance profile.
Goodman et al. compared public key cryptographic op erations on a domain-specific reconfigurable cryptographic processor (DSRCP) with previously reported FPGA im plementations and a software only implementation on a strongARM [19] . The DSRCP was implemented in a 0.25 /Lm process technology, and the energy consumption num bers were true measurements. The authors report orders of magnitude lower energy consumption for the DSRCP com pared to software and FPGA implementations. For public-key cryptographic algorithms, reconfigurability of the DSRCP is possible, while the energy consumed by the DSRCP is half that of previously reported non-reconfigurable hardware solutions. Because the DSRCP can only perform public-key encryption, it lies on the right side of the diagram in Figure 1 . Our work investigates more reconfigurable points to the left on the diagram.
For symmetric encryption, Wu et al. show a 2.25x perfor mance improvement over pure SW with CryptoManiac, which requires l/lOO th of the area of an Alpha 21264 [20] . Although the authors did not investigate energy, we acknowledge that this design would yield a significant reduction in energy per symmetric cryptographic operation. It should be noted that this work is complementary to ours because symmetric and asymmetric cryptography are used cooperatively.
IV. ALGORITHMIC IMPLEMENTAT ION DETAILS

A. ECDSA
In this study we examine the energy cost for the EL liptic Curve Digital Signature Algorithm (ECDSA), which is a variant of the Digital Signature Algorithm (DSA) that utilizes elliptic curve scalar point multiplication in place of modular exponentiation [8] . We chose the ECDSA as our benchmark because it is a standardized elliptic curve-based algorithm found in many protocol implementations, including OpenSSL [14] .
The ECDSA defines an operation for signing a message and another operation for verifying the signature of a mes sage. Our study examines the energy cost of both in order to understand the cost of an SSL handshake. A signature operation computes a single scalar point multiplication (i.e X = kP), while a verify operation computes a twin scalar point multiplication (i.e. X = u1P + U 2 Q). For our study, we employed optimized windowing techniques for scalar point multiplication, in which case the cost of a twin scalar point multiplication is not much more than that of a single scalar point multiplication [21] . Mixed Jacobian-affine coordinates were used for point addition, while Jacobian coordinates were used for point doubling [12] .
B. Multi-precision multiply and reduction
Because asymmetric cryptography involves computation of integers typically much larger than the width of the ma chine with which they are computed, multi-precision routines are necessary to perform the finite field arithmetic essential for ECDSA. Of the mUlti-precision routines, inversion and multiplication have the highest computational complexities; however, software acceleration techniques, such as the use of three-dimensional coordinate systems, reduce the number of required inversions [22] . In terms of energy, multiplication is the most costly multi-precision routine. Therefore, we will briefly review the specific mUlti-precision multiplication algo rithms used in this study. For coverage of the addition and subtraction algorithms, consult Brown et al. [23] .
Multiplication: Multiplication can be broadly divided into two categories: product scanning and operand scanning. Operand scanning is the traditional "school-book" technique, also known as "pencil-and-paper" multiplication. When imple mented in software, operand scanning requires a nested for loop with the inner-loop iterating over the multiplicand and the outer-loop iterating over the multiplier. Within the inner loop, the primary arithmetic computation is given by
assuming P = A * B. In other words, operand scanning requires a succession of multiply-add operations.
Product scanning, like operand-scanning, encompasses a nested-loop structure; however, it iterates over the result array in the outer-loop and accumulates the product terms within the inner-loop. For product scanning, the inner-loop computation is given by
such that (t, U, v ) is the accumulator register set. In other words, product scanning requires a succession of multiply accumulate operations. Operand scanning and product scan ning require the same number of multiplications; however, when a multiply-accumulate instruction is available, product scanning requires fewer adds and stores to memory. If a multiply-accumulate instruction does not exist in the target 58 architecture, the multiply-accumulate operation must be em ulated with multiplies and adds and uses additional registers, thereby diminishing the overall benefit. For our baseline ar chitecture, we found operand scanning to perform marginally better than product scanning. For that reason, we used product scanning only in the case of instruction set extensions.
Reduction: A number of techniques exist for reducing the result of the multiplication (i.e. the modulo operation). As suming the use of generalized Mersenne primes selected by the National Institute of Standards and Technology (NIST), software routines can take advantage of modular congruency in order to reduce a multiplication result using substitutions, ad ditions, and subtractions [23] . Because each key size requires a unique reduction algorithm, the NIST reduction techniques are not recommended for hardware implementation. Rather, the preferred method of reduction is Montgomery reduction [24] . Ko<; et al. provide a comprehensive examination of Montgomery multiplication, in which the Coarsely Integrated Operand Scanning (CIOS) technique stands out amongst the rest [25] .
C. Software implementation
For our baseline architecture, we implemented the various multiplication techniques in C++ and evaluated their perfor mance with a 384-bit ECDSA operation. The results showed operand scanning with NIST fast reduction to perform the best with our given HW /SW architecture. We assumed power would remain fairly constant across the various techniques, and therefore selected operand scanning with NIST fast reduction for our baseline software suite.
The instruction set extensions (discussed in Section V-B) were specifically designed to allow computation with an ac cumulator, so we compared product scanning with NIST fast reduction to the Finely Integrated Product Scanning (FIPS) using these enhancements [25] . We found that product scan ning with NIST fast reduction outperforms FIPS. Thus, our second microarchitecture uses product scanning with NIST fast reduction for multiplication. For our fully-accelerated microar chitecture (discussed in Section V-C), the CIOS algorithm was implemented in microcode for multiplication.
D. Softw are build/run-time environment
We used crosstools-ng l.14.0 to compile our software build environment, which includes the GNU Compiler Collection (gcc) 4.4.6 and Binutils 2.23. The executable binaries used for our evaluation were compiled with -02 optimization and stati cally linked to Newlib. Unless stated otherwise, the algorithms mentioned here were developed in C++. For the instruction set extensions in Section V-B and co-processor instructions in Section V-C, we modified the mips-opc. c source file to include these supplementary instructions and recompiled Binutils.
The run-time environment for our study was a bare-metal (no OS) environment representative of a low-power, embedded microcontroller. Instructions and initialization data are read directly out of ROM. A minimal amount of RAM is supplied for stack, heap, and miscellaneous data sections.
V. EVALUATED MICROARCHITECTURES
We now describe in detail the microarchitectures evaluated in this paper.
A. The Baseline Microarchitecture The baseline architecture we developed, depicted in Fig  ure 2 , consists of a RISC processor with 128KB of ROM and 8KB of RAM. This memory configuration was chosen based on the needs of our ECDSA software. The RISC processor, from here on referred to as "Pete," is a classic, five-stage, pipelined processor without cache or a Memory Management Unit (MMU). Pete executes a subset ' of the MIPS-I Instruction Set Architecture (ISA) [26] .
Main Memory
Statically scheduled multiply: One particularly unique char acteristic of the MIPS ISA is the HiILo register set used for storing multiplication and division results. The use of these registers allows the multiply/divide hardware to lie outside of the integer pipeline, as shown in Figure 2 , and therefore oper ate in parallel with the integer pipeline. For those unfamiliar with this concept, consider the assembly code below: mull $tO , $t 1 #initiate to*t1 #other independent instructions 3 #may be placed here ... mflo $t 2 #s tore lower 32-bit result in t2 5 mfhi $t 3 #s tore upper 32-bit result in t3
The mult instruction initiates the multiplication of to by n, while the mfto and mfhi instructions store the lower and upper parts of the 64-bit result, respectively. Therefore, instruc tions independent of the multiply can be statically scheduled between the mult and mfto/mfhi instructions. This feature is especially useful for hiding the cost of loop maintenance for tightly-nested arithmetic loops. Consequently, the multi precision integer algorithms required for asymmetric cryp tography particularly benefit from this light instruction-level parallelism.
Fast, parallel multiplication found on many high performance RISC cores is costly in terms of area and power [27] . To alleviate the cost of Pete's 32-bit multiplier, we l The MIPS ® unaligned load and store instructions as well as floating point instructions and those related to cache and memory management are not included in Pete's feature set.
59
designed a multi-cycle multiplication unit using only a single half-word parallel multiplication block. After examining the assembly output for our multi-precision integer routines, it became clear to us that the compiler effectively schedules instructions to take advantage of the instruction-level paral lelism that this architecture can provide. Consequently, we were able to increase the multiplication latency to four clock cycles without significantly affecting the execution time of the multi-precision multiplication routines.
Karatsuba multiplier implementation: To further reduce dynamic power, we based our multi-cycle multiplication unit on Karatsuba's divide-and-conquer technique, described by GroBschadl et al. [28] . 
Equation 7 expresses Karatsuba multiplication mathematically, such that P is the product and AH, AL, BH, BL represent the input operands, A and B, split into high and low parts.
The principal advantage of Karatsuba multiplication is that only three half-word multiplications are needed, as opposed to four with operand or product scanning methods. It should be noted that the term enclosed by square brackets in (7) can be less than zero, so Karatsuba multiplication introduces signed arithmetic within an unsigned computation. If the multiplication unit is expected to handle signed as well as unsigned multiplication, which was the case for our work, then this will not necessitate an exorbitant amount of extra logic when compared to other techniques. The primary arithmetic components of our Karatsuba multiplier include a 17-bit by 17-bit signed parallel multiplication block, a four-port 49-bit adder, and two 16-bit subtraction units.
B. [SA Extensions
Instruction set extensions are special purpose instructions built into an existing ISA in order to enhance the execution of particular algorithms. For many applications, including DSP, communications, and cryptography, these special purpose instructions have shown considerable speedup with very little additional overhead. We consider instruction set extensions a "middle-of-the-spectrum" acceleration technique and there fore feel they warrant consideration in our comparison study. GroBschadl et al. extensively explored the use of instruction set extensions for public-key cryptography on various RISC platforms including MIPS and SPARC V8 [28] , [29] . Although their research covers both G F(p) (prime finite fields) and GF(2m) (binary finite fields), we will only focus on GF(p) in this study. In the future, we will consider GF(2m) compu tations as well.
GroBschadl et al. recommend four supplementary instruc tions for acceleration of all variations of product scanning multiplication (i.e. Comba, FIPS, etc.). These instruction set extensions are summarized in Table I . One thing to note is the expansion of the HiILo register set to include a third 32-bit register referred to as the OvFlo register. Those familiar with the MIPS ISA might notice that the MADDU instruction is TA BLE I: Instruction set extensions for public-key cryptography. Adapted from the work of GroBschiidl et al. [29] . actually available in later versions of the MIPS ISA. The differ ence here is support for higher precision accumulate operations necessary for product scanning multiplication. The M2ADDU instruction is an optimization specifically for squaring, while the ADDAU instruction improves the performance of the FIPS Montgomery multiplication algorithm and potentially the NIST reduction algorithms. The SHA instruction is needed for all variations of the product scanning algorithm and facilitates access to the OvFlo register [29] .
Format
I I
The suggested ISA extensions needed only a minimal amount of modification to our baseline microarchitecture. Aside from extra decode logic in the main pipeline, most of the modifications were concentrated within the Karatsuba multiplication unit. For example, the four-port adder was widened to 50-bits, and extra internal carry bits were added. The multiplexing logic was modified to support extra data paths from the result registers (for accumulate) and the operand registers (for the ADDAU instruction). Additionally, result shifting and stores into the OvFlo register were added. Figure 3 depicts Pete's multi-cycle multiply-accumulate unit with the ISA extension modifications highlighted. It should be noted that the multiplication block remained untouched.
60
C. Fully Accelerated Microarchitecture
Continuing towards the right of the spectrum shown in Fig  ure 1 , we augmented our microarchitecture with an accelerator designed specifically for GF(p) arithmetic. Figure 4 depicts the top-level diagram of our final microarchitecture with the accelerator, referred to as "Monte," on the left, the memory in the center, and Pete on the right. Similar to work described by Koschuch et al. [30] , Pete and Monte utilize a shared memory interface in order to reduce any bottlenecks that might be created with a bus interface. Hence, the 8KB of RAM found in our baseline architecture was extended to a true dual-port memory to which both Pete and Monte can read/write.
Co-processor interface: To coordinate communication be tween Pete and Monte, we implemented a portion of the co processor interface defined in the MIPS architecture; specifi cally we modified Pete to include Co-Processor 2 instructions for the command and control of Monte. These coprocessor instructions are listed in Table II . The first instruction, ctc2, allows Pete to initialize the control registers within Monte. As will be discussed later, these control registers allow run time configuration of the algorithms executing within Monte. The second instruction, cop2Sync, facilitates synchronization between Pete and Monte, typical of any parallel processing system. Instructions cop21dA, cop21dB, and cop21dN initiate Di rect Memory Access (DMA) transfers from shared memory to operand buffers within Monte. The start address in shared memory of A, B, and N is contained within Pete's General Purpose Register (GPR) rt. Instructions cop2muJ, cop2add, and cop2sub initiate modular multiply, add, and subtract, respectively. The result of the above computation instructions is copied back into memory by the cop2st instruction, which initiates a DMA transfer from the result buffers within Monte out to shared memory. It should be noted that the above in structions are multi-cycle instructions with latencies dependent on the size of the finite field. 
Main Memory : Monte instructions: The instructions described in Table II SaO #m ust wait until mul done 7 #instructions below do not depend on previous cop21dA $tO #can run ahead of store! 9 cop21dB $t 1 #same he re cop2add #A+ B mod N 11 cop2st $t 3 #m ust wait until add is done cop21dA $t 3 #m ust be fo rwarded during store 13 cop21dB $sO #can run ahead of s tore cop2sub #A-B mod N
The loading of operands and storing of results are over lapped with computation via a double buffering scheme. Sim-ilarly, operand data is buffered separately from result data to increase buffer bandwidth and avoid unnecessary stalls in the arithmetic logic, while at the same time not demanding more ports per buffer. To reduce the number of reads from shared memory, we have included forwarding paths from the result buffer to the operand buffer. Consider the snippet of code above to understand how Monte reorders instruction execution.
At line 2, the load instruction will be immediately dis patched to the DMA and a transfer will be started during the next clock cycle. Meanwhile, the next instructions will be queued because the instruction at line 3 is not able to dispatch until the current DMA transfer is complete. After instruction 4 dispatches, a DMA transfer will be started and instruction 5 will dispatch to the FFAU. Instruction 5 will not issue (i.e. start execution) until the current DMA transfer has completed. Once instruction 4 finishes, Monte will swap operand buffers, and instruction 5 will begin executing. At the same time, instruction 6 will dispatch to the DMA, where it will wait in a reservation register until instruction 5 completes. Note that the DMA functional unit contains a reservation register for stores. Loads, however, are initiated upon dispatch, so a reservation register is not necessary for them. Instruction 6 will wait in the store reservation register until instruction 5 completes. In the mean time, instructions 8 and 9 will be processed while instruction 5 continues to execute. The multiply is an expensive operation, so instruction 10 will be held up in the instruction queue until the multiplication completes. When instruction 5 finally finishes, the result buffer will be swapped, and instruction 10 will be dispatched. On the next cycle, instruction 10 will begin because its operands have already been loaded, and instruction 6 will begin storing the multiplication result out to memory. Once instruction 6 has completed, the store instruction at line 11 can dispatch into the reservation register, where it will wait until the add completes.
In the meantime, instruction 12 will dispatch. Now, instruction 12 will cause a Read-After-Write (RAW) hazard with instruc tion 11. Instead of executing instruction 12, a forwarding bit in the DMA unit will be asserted, and instruction 12 will be discarded. Instruction 13 will then dispatch and begin a transfer on the next clock cycle because it does not pose a RAW hazard. Once instruction 10 and 13 complete, the DMA will begin storing the add result out to shared memory, while at the same time, copying the data into operand A. Instruction 14 will dispatch but cannot start until the store has completed the forwarding operation.
Finite Field Arithmetic Unit: For accelerated finite field arith metic, we designed a microcoded FFAU. A zoomed in view of the FFAU in Figure 5 reveals that the major components of our accelerator include an arithmetic core, multiplexing logic, address logic, and a control unit. The 32-bit arithmetic core is a flexible, 2-stage pipelined multiply-add unit, which is capable of performing various combinations of adds and multiplies depending on input control bits. Flip-flops within the arithmetic core store intermediate carries to allow for efficient pipelining of the back to back multiply-adds required by the multi-precision arithmetic. The address logic is nothing more than a few index registers, which generate the operand and result addresses in parallel with the computation. The mul tiplexing logic provides the FFAU with enough flexibility to compute the semi-complex CIOS Montgomery multiplication algorithm.
The control unit contains a 64 entry microcode table, along with built-in hardware for nested loop structures and other conditional branches. In an attempt to balance the trade-off between performance and reconfigurability, the control unit contains a set of control registers, programmable by the ctc2 instruction. Precomputed algorithm parameters as well multi precision integer width must be preloaded into Monte prior to use. A return address register has been included to allow subroutine calls (leaf functions only).
VI. EVALUATION
A. Methodology
We developed fully synthesizable Veri log models for the baseline processor ("Pete"), ISA-extended processor, and the Finite-Field Accelerator ("Monte"). Front-end sysnthesis was utilized to construct the arithmetic components, including the 17-bit by 17-bit multiplication blocks within Pete, and the 32-bit multiply-add unit within Monte. Post-synthesis power estimations for core logic on a 45 nm technology node with a 100MHz clock were performed using Synopsys Prime-Time, a timing and power analysis tool for CMOS logic [15] , [31] .
To estimate memory power, we used counters to keep track of the number reads and writes to and from the memories 62 embedded within the testbench, and we used Cacti to extract estimates of energy per read/write and leakage power [32] . Unfortunately, no equivalent tool for estimating ROM power exists. As a conservative estimate, ROM dynamic power was assumed to be equivalent to a comparably sized RAM, while ROM static power was assumed to be zero.
Each cryptographic operation requires millions of clock cycles and is arduous to simulate post-synthesis. For power estimations using the techniques discussed above, we simu lated a portion of the algorithm representative of the entire algorithm. To measure execution times, we emulated each microarchitecture on a Xilinx Virtex-5 FPGA. Figure 1 . For ISA extensions, we observe between 1.28 to 1.41 factor improvement in energy efficiency over baseline, while for full acceleration, we observe a 4.36 to 6.45 factor improvement. Furthermore, we observe the energy consumed increases quite rapidly as the key size is increased. Examina tion of the data reveals the increase is substantially greater than quadratic for the pure software configuration, while for the microarchitecture with ISA extensions, the increase is closer to quadratic. The effect of key size is much more gradual for the energy consumed by the fully accelerated microarchitecture, coming in just slightly less than quadratic. Thus, special purpose hardware becomes much more attractive as greater security requirements are demanded.
B. Analysis
Energy breakdown: Figure 7 displays the energy consumed per I92-bit and 384-bit operation, broken down into sub components, vs. microarchitecture. The results show most of the energy consumed by the baseline and ISA extended microarchitectures is spent in the ROM. In these microarchitec tures, an instruction must be read nearly every cycle; therefore, the ROM is kept very active. We found that this is a common theme amongst low-power embedded processors [33] . For the Pete and Monte combination, Monte consumes the majority of the energy when considering the 384-bit computation but is nearly tied with instruction reads for the I92-bit computation. From this, we ascertain that as key size is increased the work load shifts from Pete to Monte. Intuitively, this make sense. The portions of the algorithm that have a high computational complexity have been pushed off to Monte, so as key size grows, the amount of relative work between Pete and Monte shifts towards Monte. We also note that the ROM energy is much less when Monte is in use. Monte's microcode RAM is producing most of the instructions when Monte is in use so the ROM's activity factor is dramatically lower. Another interesting observation that can be made in Fig  ure 7 is that the energy consumed in the shared memory decreases as the level of acceleration increases. This is partially due to reduced execution time decreasing the amount of energy spent on leakage power but also due to the fact that each acceleration technique aims to reduce access to memory. For example, product scanning used with the proposed instruction set extensions requires fewer stores than operand scanning. Moreover, Monte utilizes smaller, internal buffers and data forwarding to reduce accesses to shared memory.
Power Consumption: In terms of power consumption, we found that all three microarchitectures consume power at approximately the same rate. ISA extensions cost an additional l.8% average power compared to baseline, while the Pete and Monte duo consumes 10.5% less power compared to baseline. Despite the extra hardware cost of Monte and the subsequent static power increase, the reduction in ROM instruction reads leads to a slight power reduction overall. When comparing static to dynamic power, we observed that power gating idle components during an ECDSA computation would not be worthwhile because at best this would reduce a fraction of the already relatively small static power.
Energy-Delay Product: Energy-Delay Product (EDP) is a useful metric that takes into account savings in energy as well as reduction in latency. To further quantify the benefit of each of the evaluated microarchitectures, we graphed EDP vs. key size in Figure 8 . Relating back to the diagram in Figure I , the EDP spreads the spectrum in such a way that the gaps between points on the X-axis increase when moving from left to right. For reference, we have included Table III , which shows the latency per cryptographic operation for various system configurations. A 100 MHz clock was assumed. The combined Signature and Verify latency closely models an SSL handshake on the client side.
C. Double Buffer Evaluation
To quantify the energy savings of Monte's instruction reordering scheme, we estimated energy consumption for 384-bit ECDSA with double buffering removed. The results demonstrate that overlapping data movement with computation amounts to a 13.5% improvement in energy consumption. The energy savings comes from less idle time for Pete and Monte in addition to a reduction in the number of reads to shared memory. For the I92-bit key size, we measured a 9.4% reduction in energy due to double buffering. Therefore, Monte with double buffering scales better with larger key sizes. This is explained by the increasing time that data movement costs as the key size grows. 
D. Evaluation with Instruction Cache
Due to the significant contribution instruction reads make to the overall energy per operation, we felt further investigation was warranted. Thus, we modeled our three systems with an ideal (no misses whatsoever) 4KB direct mapped instruction cache with a 16 byte line size. Although we acknowledge this model is unrealistic, it will help guide the direction of our future work by revealing the range of benefits available with caching for reduced energy. As before, we used Cacti to estimate the energy of our cache. Figure 9 shows our results across various key sizes. We observe that the ideal instruction cache brings the energy consumption of both microarchitec tures without Monte down considerably, leading us to believe caching could improve the overall energy efficiency when an accelerator is not available. Investigating further, Figure 10 compares the energy per operation consumed by the baseline microarchitecture, mod eled with an ideal instruction cache, to the microarchitecture that includes Monte, modeled with and without an ideal in struction cache. The results show that even when the baseline is paired with an ideal instruction cache, the accelerated microar chitecture performs an ECC sign and verify operation much more efficiently. Also, we observe that the instruction cache has a negligible effect on the accelerated microarchitecture, which is due to the fact that Monte significantly reduces the amount of instruction fetching. 
E. Baseline Va lidation
To validate the energy efficiency of our baseline microar chitecture, we measured Pete against a similarly configured Microblaze processor (i.e. 5-stage pipeline, no cache, no MMU, full 32-bit by 32-bit multiplier, binary divider) on the Xilinx Virtex-5 platform [34] . The synthesis results reveal that Pete requires 34.3% more LUT-flip-flop pairs (i.e. more FPGA fabric); however, Pete requires 75.0% fewer Digital Signal Processing (DSP) blocks compared to Microblaze. We attribute the difference in resource consumption to our Karatsuba multiplier. Muti-cycIe multiplication performs more addition and requires control logic, all of which utilize LUTs, while parallel multiplication maps well to the DSP hardware blocks on the Virtex-5. This trade-off is a win in the ASIC realm, which is the target technology for this study. In terms of performance, Pete outperforms Microblaze by 17.7% for a 384-bit ECDSA signature followed by a verify. Note that this is in spite of a longer latency multiplication unit, which demonstrates an advantage of a separated multiplication unit over ISAs without it.
To validate the efficiency of our multiplier, we synthesized Pete for a 45 nm technology library with various multiplier configurations and measured the power of each using the meth ods further explained in the Methodology section. Compared to Pete with a traditional operand-scanning, multi-cycle multiplier with the same latency, our measurements showed a 4.69% decrease in dynamic power and a 3.47% increase in static power. Because dynamic power dominates, Karatsuba's tech nique yielded an average power savings of 3.52%. Compared to Pete with a parallel pipelined multiplier as found in many of the modern 32-bit RISC cores, our Karatusba, multi-cycle multiplier demonstrated a 10.6% and 28.4% improvement in dynamic and static power, respectively. This equates to a 13.4% power savings overall. Further investigation is necessary to determine how much energy savings this yields.
VII. CONCLUSIONS
In conclusion, we have provided a thorough analysis of the design space for ultra-low energy asymmetric cryptog raphy. We began by evaluating the energy per asymmetric cryptographic operation (sign + verify) on an efficient baseline architecture centered around a pipelined RISC processor. We then included simple, yet beneficial instruction set extensions to our microarchitecture and evaluated the improvement in terms of energy per operation compared to baseline. Finally, we introduced a novel, microcoded, state machine-based, fi nite field arithmetic accelerator to our microarchitecture and measured the energy per operation against the baseline and the ISA extensions.
Our analysis showed that the energy benefit of hardware acceleration increases substantially as the required level of security is increased. We also demonstrated that, depending on the energy cost of instruction reads, the accelerated mi croarchitecture can reduce power as well as execution time, which exaggerates the advantages of hardware acceleration when considering an energy-delay product.
