Abstract-While public-key cryptography is essential for secure communications, the energy cost of even the most efficient algorithms based on Elliptic Curve Cryptography (ECC) is prohibitive on many ultra-low energy devices such as sensornetwork nodes and identification tags. Although an abundance of hardware acceleration techniques for ECC have been proposed in literature, little research has focused on understanding the energy benefits of these techniques. Therefore, we evaluate the energy cost of ECC on several different hardware/software configurations across a range of security levels. Our work comprehensively explores implementations of both GF (p) and GF (2 m ) ECC, demonstrating that GF (2 m ) provides a 1.31 to 2.11 factor improvement in energy efficiency over GF (p) on an extended RISC processor. We also show that including a 4KB instruction cache in our system can reduce the energy cost of ECC by as much as 30%. Furthermore, our GF (2 m ) coprocessor achieves a 2.8 to 3.61 factor improvement in energy efficiency compared to instruction set extensions and significantly outperforms prior work.
I. INTRODUCTION
Since its introduction in 1976, public-key cryptography has been an essential component for secure communications [1] . Unfortunately, due to its high computational complexity, public-key cryptography is particularly challenging to implement on resource-constrained embedded systems. Compared to legacy cryptosystems, such as RSA, Elliptic Curve Cryptography (ECC) has been identified as a computationally efficient class of public-key algorithms [2] . However, even ECC is a computational burden on many ultra-low energy devices found in sensor networks and identification tags [3] , [4] . Consequentially, a number of studies have proposed techniques for accelerating ECC in hardware. Although much of this work has focused on the performance of ECC in embedded systems, only a few studies have investigated its energy efficiency [3] , [5] , [6] , [7] , [8] , [9] . Of those few, none have provided a comprehensive evaluation of both prime -GF (p) -and binary -GF (2 m ) -fields including up to 521-and 571-bit implementations.
In this study, we compare the energy benefits of various levels of hardware acceleration for ECC. The main objective of hardware acceleration is to drastically improve the computational efficiency of an algorithm's implementation, while maintaining a necessary degree of reconfigurability. To accomplish this, the computationally intense portions of the algorithm are mapped into hardware, while the reconfigurable portions remain in software. For ECC, the finite-field arithmetic, i.e., GF (p) or GF (2 m ), comprises the majority of the computational complexity and is, therefore, a good candidate for acceleration. Depending on how much of the field math is implemented in hardware, there are varying degrees of ECC acceleration, each of which have a different energy requirement.
Our prior work compares the energy cost of GF (p) ECC for three hardware/software configurations [10] . We started with a baseline system that consists of a small RISC processor, referred to as "Pete," and a minimal memory configuration. We then extended Pete with prime-field Instruction Set Extensions (ISEs) proposed by Großschädl et al. [11] . Finally, we introduced a reconfigurable, prime-field accelerator, referred to as "Monte." Our goal in this work is to provide further insight into the energy cost of both GF (p) and GF (2 m ) ECC. To do so, we augment Pete with GF (2 m ) ISEs and compare the energy cost of binary-field ECC with and without ISEs. Then, we design a non-configurable, binary-field coprocessor and evaluate the energy improvement.
In addition, we evaluate the energy benefit of an instruction cache when performing ECC. In prior work, Pete did not include an instruction cache; however, instruction fetch from the ROM was shown to be a significant energy cost during an ECDSA signature and verification operation [10] . Thus, we develop a parameterizable instruction cache and integrate it into Pete. For our comparison of GF (2 m ), we use the five binary fields recommended by the National Institute of Standards and Technology (NIST). Moreover, we extend our prior work to include the NIST 224-and 521-bit prime fields. The contributions of this work are summarized as follows:
• Development of an improved GF (2 m ) coprocessor, which provides a 2x speedup vs. similarly configured prior work [12] • Energy and performance evaluation across a range of ECC key-sizes, including GF (p) 521-bit and GF (2 m ) 571-bit
• Evaluation of the energy benefit of an instruction cache in the context of ECC
II. ALGORITHMS AND SOFTWARE

A. ECDSA
We use the Elliptic Curve Digital Signature Algorithm (ECDSA) as our benchmark because it is standardized and used in protocols such as OpenSSL [2] , [13] . ECDSA defines a signature and a verification operation, both of which we examine in order to understand the cost of an SSL handshake. Figure 1 depicts the computational hierarchy for ECDSA with finite-field arithmetic at the foundation. Finite-field arithmetic is essentially addition, subtraction, multiplication, and inversion on a finite set of elements. In terms of clock cycles per operation, field inversion is the most costly, with multiplication coming in second. The number of field inversions required is kept to a minimum, however, making multiplication the most costly operation overall. Significantly, when we accelerate ECC, the finite-field arithmetic is the portion of the algorithm that gets mapped into hardware, while the rest remains in software and is, consequently, reconfigurable. Next in the computational hierarchy are the point addition and doubling algorithms that perform mathematical operations on an elliptic curve over a finite field. For ECC, the underlying field can be either prime or binary -GF (p) or GF (2 m ) respectively. Both field types have endorsements by NIST. Mathematically speaking, the point double and add operations constitute an Abelian group with the points on the curve and a point at infinity (i.e., the identify element). Although an elliptic curve is described in two dimensions with the Weierstraß equation, practical implementations use a three-dimensional coordinate system to avoid costly field inversions. For our GF (p) implementations, we use mixed Jacobian-Affine coordinates, while for GF (2 m ), we use mixed Lopez-DahabAffine. These coordinate systems are optimal in that they require the least amount of field operations for their respective curves [14] . For field inversion on our ISE architectures, we use the extended Euclidean algorithm, while our coprocessor extended architectures use Fermat's theorem.
Continuing up the hierarchy, we have the scalar point multiplication algorithms. An ECDSA signature requires a single scalar point multiplication (X = kP ), while a verification requires a twin scalar point multiplication (X = u 1 P + u 2 Q). For a single scalar point multiplication, we use a slidingwindow algorithm that uses two pre-computed points (3P and 5P ) and takes advantage of the fact that point subtraction is only marginally more costly than addition. For the twin scalar point multiplication, we use an algorithm that precomputes P − Q and P + Q and then simultaneously scans both multipliers (u 1 and u 2 ). In such a case, the cost of a twin scalar point multiplication is less than two single scalar point multiplications [15] . We evaluated Montgomery scalar point multiplication for use with our binary-field coprocessor but found the algorithm to be more costly in terms of performance and energy compared to the sliding-window algorithm [16] .
Encompassed within an ECDSA signature and verification are also mathematical operations performed modulo the prime group order of the curve. These operations are done in addition to the scalar point multiplications to complete either a signature or a verification operation. For inversion modulo the group order, we implement the extended Euclidean algorithm on Pete for all hardware/software configurations.
B. GF (p) and GF (2 m ) Multiplication
Various algorithms exist for the finite-field multiplication. For each architecture, we selected the most energy efficient multiplication algorithm. For the baseline architecture, we use operand-scanning multiplication with NIST fast reduction. For the ISE architecture, we use product-scanning multiplication with NIST fast reduction. For full GF (p) acceleration, Monte uses Montgomery multiplication, which interleaves the reduction [10] . For full GF (2 m ) acceleration, Billie uses a hardware implementation of fast reduction interleaved into the multiplication.
C. Software build/run-time environment
We used crosstools-ng 1.18.0 to compile our build environment, which includes the GNU Compiler Collection (gcc) 4.7.2 and Binutils 2.23. The executable binaries used for our evaluation were compiled and statically linked to newlib. Unless stated otherwise, the algorithms mentioned here were developed in C++. For the instruction set extensions in Section III-B and coprocessor instructions in Section III-D, we modified the mips-opc.c source file to include these supplementary instructions and recompiled Binutils.
The run-time environment for our study was a baremetal (i.e., no OS) environment representative of a low-power, embedded microcontroller. Instructions and initialization data are read directly out of ROM. A minimal amount of RAM is supplied for stack, heap, and miscellaneous data sections.
III. HARDWARE
A. Baseline
Pete is a five-stage pipelined RISC processor without a memory management unit. For our baseline, we chose a minimal memory configuration representative of a low-power embedded system with 256KB of program ROM, 16KB of RAM and no cache.
B. Instruction Set Extensions
ISEs are simply additional instructions that enhance the execution of specific algorithms. Großschädl et al. evaluated the performance gain of ISEs for both GF (p) and GF (2 m ) ECC on various RISC platforms [11] , [17] . Our prior work investigated the energy benefit of only GF (p) ISEs, whereas here we compare both GF (p) and GF (2 m ). Table I ) [17] . These instructions use three 32-bit registers as an accumulator, allowing for more efficient product-scanning multiplication. The first three instructions are specifically for GF (p), while the last two are for GF (2 m ). Binary-field arithmetic is carry-less computation, i.e., add is simply a bitwise XOR. Because most instruction sets include an XOR instruction and carry-less add does not require a reduction operation, binary-field addition in software is much faster than its prime counterpart. Unfortunately, the same is not true for multiplication, because most instruction sets do not include support for a carry-less multiply. Consequently, the binary-field multiplication must be inefficiently emulated with shift and XOR operations, rendering softwareonly implementations of binary-field ECC impractical for most embedded processors. In such a case, ISEs can provide a dramatic improvement. mulgf2 is a 32-bit by 32-bit carry-less multiply, i.e., the binary-field equivalent of the mul instruction in MIPS. Note that this operation is represented with ⊗. maddgf2 is a carry-less multiply-accumulate instructions, i.e., the binary-field equivalent of the maddu instruction in Table I . Here, we use ⊕ to mean carry-less add. sha enables access to the contents of the overflow register and provides an arithmetic shift necessary for multiplication and squaring.
For accelerating ECC, Großschädl et al. recommend six instructions (shown in
To implement the binary-field ISEs, we had to modify Pete's instruction decode unit and the Karatsuba multiplyaccumulate unit. The modifications to the Karatsuba multiplyaccumulate unit are highlighted in Figure 2 . The most notable change is the inclusion of a 16-bit by 16-bit carryless multiplication unit. Rather than overcomplicating the design with a signed multiplication block that also supports carry-less multiply, we chose to multiplex between the two multiplications units depending on the computation mode. For the four-port addition unit, we designed a dual-mode adder that supports normal addition and carry-less addition. We used a similar design for the 16-bit subtraction units at the top of Figure 2 . Fortunately, no other modifications to the datapath were required. For the top-level FSM, we added a control signal that selects the correct computation mode.
C. Instruction Cache
Previously, we made the observation that instruction fetch from the program ROM is a significant energy cost [10] . Three factors contribute to the relatively high energy cost of instruction fetch. First, RISC processors, like Pete, fetch an instruction from memory on almost every clock cycle, causing a large number of reads from program ROM. Second, the energy cost per read of a memory depends on the size of memory, so larger memories consume more energy. Finally, compared to the other memory components in the system, the program ROM is the largest by far.
To reduce the energy effects of instruction fetch, we implemented a simple direct-mapped cache in Verilog and integrated it into our embedded system. The cache lines in our design hold four 32-bit words each (16 bytes wide), but the number of cache lines is parameterizable, allowing us to experiment with different cache sizes. We added an instruction cache and expanded the program ROM port to 128 bits, which allows an entire cache line to be filled at once. In a System on a Chip (SoC), the program ROM is fabricated on the same silicon die as the processor logic. This makes wider ports to memory far less expensive in terms of energy compared to offchip memory. The primary advantage of a 128-bit program ROM port in our system is a decrease in the miss penalty, which ultimately decreases the energy wasted while Pete is waiting for the correct cache block. To reduce the number of wires and further reduce the cost of ROM access, we made the ROM single-ported.
The changes to the ROM interface and the inclusion of an instruction cache require a slightly more complicated memory system. The data bus from Pete still needs 32-bit access to the program memory. Accordingly, we added data and instruction buffers to transition from a 128-bit memory port to a 32-bit bus. Furthermore, we had to include arbitration in our ROM controller in order to multiplex the single port. This means that the data and instruction buses, as well as the instruction cache, must contend for access to the program ROM. Although this presents a structural hazard in our system, it has no noticeable impact on performance once the software system has been initialized.
D. Binary-field Accelerator
As previously discussed, binary-fields, i.e., GF (2 m ), are advantageous in that addition does not require carry propagation. Thus, custom hardware implementations can perform addition over the entire length of a field element singlecycle, without lengthening critical path. This lends itself to computationally efficient digit-serial multiplication with fieldspecific reduction [18] . Furthermore, binary-field squaring can be performed simply with a handful of XOR gates when the binary field is fixed [14] . Therefore, we designed and evaluated a GF (2 m ) accelerator for further energy efficiency. Figure 3 shows the top-level diagram of "Billie," the binary accelerator, with Pete.
1) Coprocessor Instructions:
Prior research has suggested that the communication between the main processor and the binary-field accelerator can significantly limit performance [12] . Thus, Billie utilizes the MIPS coprocessor interface for instructions and control data to reduce this potential bottleneck. In such cases, Pete fetches binary-field instructions and feeds them directly to Billie at a high rate. Similar to the configuration with Monte, Pete and Billie share the dual-port RAM to eliminate inefficiencies caused by processor-to-accelerator data transfers. Table II lists the instructions added to Pete in support of the binary-field coprocessor.
Billie is a load-store architecture, so cop2ld and cop2st are used to move data to and from her 16-entry register file. Specifically, cop2ld loads a multi-precision field element from memory, starting at the address referenced by the rt GPR into the Billie Register (BR) specified by fs. 1 Conversely, cop2st stores a field element from the fs BR into memory, starting at the address referenced by the rt GPR. Continuing the loadstore concept, the binary-field arithmetic instructions pull input data from and write results back into Billie's register file. For multiplication and addition, cop2mul and cop2add follow the three-operand instruction format where fs and ft are the input operands, and fd is the result operand. Because squaring is a unary operation, the cop2sqr instruction only requires a twooperand format where fs is the input operand, and fd is the result operand. Finally, the cop2sync provides synchronization between Pete and Billie, typical of parallel processing systems. 1 The General Purpose Registers (GPRs) are part of Pete's register file.
2) Microarchitecture
other NIST binary fields in order to investigate scalability. A view of the microarchitecture is illustrated in Figure 3 . From a high level, our design for Billie takes a similar approach to the original IBM 360 floating point unit [19] . Notable features include an instruction queue, register file, load/store unit and separate functional units for multiplication, squaring and addition.
Coprocessor instructions fetched by Pete are first buffered in Billie's four-entry instruction queue. This avoids stalling Pete while the longer-latency binary-field instructions execute. Note that we varied the depth of the queue and found no significant improvement beyond four. When an instruction is at the head of the queue, the logic decodes it and checks for structural and data hazards. A structural hazard exists when the appropriate functional unit is currently busy, while a data hazard exists when the input operands have not yet been stored in the register file. If no hazards exist, the operands will be read from the register file, and the instruction will dispatch to the corresponding functional unit. On the next clock cycle, the instruction will begin executing and, once complete, will remain in the functional unit until the result has been written back into the register file. In this architecture, reads from the register file are prioritized over writes. Thus, write-back of the result will occur when an instruction is not being dispatched.
To reduce structural hazards, the register file has two read/write ports (dual-port). The data paths between the register file and the functional units are 163 bits wide. Multiple functional units require write access to the register file, so the port multiplexor must also perform arbitration. We chose a simple scheme in which each functional unit is statically assigned a port for writing into the register file. For instance, the multiplier and squaring unit both share a port, while the adder and load/store unit share another. If both functional units assigned to the same port are ready to write during a given clock cycle, the arbiter will allow one to write and stall the operation of the other. For simplicity, the priorities of each functional unit are fixed. In our design, the multiplier and adder have higher priority over the squaring and load/store units, respectively.
The register file contains sixteen 163-bit registers to accommodate all intermediate computations for a scalar-point multiplication. Because we use sliding-window algorithms that leverage some precomputation, we require twice the number of registers as compared to Guo et al. [12] . However, as shown in Section IV, the extra registers yield a significant performance advantage and have the potential to save energy. The load/store unit is responsible for transferring binary-field elements between Billie's register file and shared memory. The interface to shared memory is 32-bits wide, while the interface to the register file is a field width (i.e., 163-bits for this particular configuration). Thus, the load/store unit serves as a buffer between these two mismatched ports and is analogous to Monte's DMA unit.
3) GF (2 m ) Arithmetic Units: For GF (p) computation, the propagation of arithmetic carries from the least significant bit position to the most within multiplication and addition typically becomes the clock-rate limiting critical path. From an implementation standpoint, full field-width GF (p) compu- tation is impractical, especially when considering the larger NIST fields. Thus, field elements are broken up into smaller words, and computation proceeds at that granularity (i.e., multi-precision). For GF (2 m ) computation, carry propagation does not exist, so full field-width addition is possible and advantageous. Compared to multi-precision computation, full field-width GF (2 m ) arithmetic requires less complex logic and scales more easily to increasing field widths. The hardware scalability is a consequence of data-level parallelism afforded by the carry-less computation.
The arithmetic units within Billie take advantage of this parallelism by performing addition over an entire m-bit binary polynomial in a single clock cycle. Because addition is fast, we employ digit-serial multiplication that iterates over the multiplier, shifting and adding the multiplicand into an accumulator, accordingly [20] . Specifically, Algorithm 1 describes the multiplication operation in detail, where a(x) is the multiplicand, b(x) is the multiplier, c(x) is the accumulator, and D is the digit width. D has a maximum width of 156, 159, 271, 322, or 561 bits, corresponding to the five NIST binary fields [18] . As shown, Step 1 zeros out the accumulator. Initially, Step 3 multiplies the least significant digit of the multiplier (B 0 ) by the multiplicand and adds the result to the accumulator. Concurrently, Step 4 shifts the multiplicand D bits to the left and reduces the result modulo f (x), the irreducible polynomial. Note that this algorithm integrates the polynomial reduction into the multiplication. Steps 3 and 4 repeat with the next significant digit of the multiplier until the multiplication is complete. The final step reduces the m+D−1 result to m-bits with f (x).
IV. EVALUATION
A. Methodology
We evaluated the energy required for an ECDSA signature and verification with various hardware/software configurations assuming a 45nm technology node. To estimate the energy consumed in the processor logic, we synthesized our HDL code for a TSMC cell library with Synopsys Design Compiler. Then, we simulated the post-synthesis logic and used PrimeTime to estimate the power based on the simulated logic transitions [21] , [22] . For the memories, we used HP Cacti to estimate the energy per read/write as well as the static power [23] . Then, we simulated our HDL with Verilator to determine the run time in clock cycles of each operation, in addition to the number of reads and writes to and from each memory [24] .
B. GF (p) and GF (2 m ) Energy
We thoroughly cover the design space by evaluating six hardware/software configurations, across the five NIST recommended prime fields and binary fields. The systems we evaluate are listed below: 1) Baseline RISC processor (Pete) without any additional hardware support for ECC. 2) Pete with GF (p) ISEs.
3) Pete with instruction cache and GF (p) ISEs. 4) Pete with GF (2 m ) ISEs in addition to GF (p) ISEs. 5) Pete with reconfigurable GF (p) accelerator (Monte). 6) Pete with non-configurable GF (2 m ) accelerator (Billie). Figure 4 summarizes our results by plotting the energy per operation (a signature and a verification) for GF (p) and GF (2 m ) of equivalent security. 2 Note that we do not show the results for GF (2 m ) on our baseline here because the energy cost is too high (7700μJ for 571-bit). Also, notice that the energy consumption for GF (p) 521-bit actually goes off the chart (1800μJ), demonstrating the need for hardware assistance at the larger field sizes. For GF (p) ISEs, we observe between 1.32 to 1.48 factor improvement in energy efficiency over baseline, and if we add a 4KB instruction cache, we see an additional improvement of up to 30%. Overall, this equates to a 1.67 to 2.08 factor improvement in energy efficiency. Note that for ECDSA on Pete, we found that 4KB is the energy optimal instruction cache size. Comparing GF (2 m ) and GF (p) ISEs, we see that GF (2 m ) is 1.31 to 2.11 times more energy efficient. For full acceleration of GF (p) with Monte, we observe a 3.93 to 4.75 factor improvement over GF (p) ISEs and 5.17 to 6.34 over baseline. Finally, for full GF (2 m ) acceleration with Billie, we observe a 1.94 factor improvement over Monte for 163-bit. However, as we move out to larger field sizes, the energy cost for Billie converges with that of Monte. Thus, our binary-field accelerator is not scaling well past 163-bit. Figure 5 shows the energy consumption per operation broken down into subcomponents across the security spectrum for our ISE architectures. On the left is Pete with GF (p) ISEs, while in the middle is Pete with GF (p) ISEs and a 4KB instruction cache. On the right is Pete with GF (2 m ) ISEs. In each of these runs, Pete's power remains fairly constant, only slightly decreasing with the use of GF (2 m ). Therefore, most of the energy variation with Pete is due to changes in execution time. For the ISE architectures without an instruction cache, 2 Equivalent security is defined per NIST SP 800-57. the program ROM makes up nearly half of the energy. When we add an instruction cache, we end up trading some additional energy in the "Uncore" region for a drastic reduction in energy for the program ROM. Note that the Uncore region here refers to the memory interconnection logic and includes the instruction cache. Also, Pete's energy consumption increases slightly when using an instruction cache. This is primarily due to energy being consumed during idle cycles while the instruction cache fetches from the program ROM.
For the RAM, the dynamic energy is decreased as the number of accesses is reduced with GF (2 m ). Likewise, the static energy slightly decreases with reduced execution time. One interesting thing to note is the lower computational complexity with GF (2 m ) support. At the smallest key size, the binary field is smaller than the prime field, and the binary ISE configuration consumes 52.5% less energy than its prime counterpart. Moving to greater security, the field sizes cross over such that the binary field is larger (e.g., 283 compared to 256). However, in this case, the binaryfield implementation still consumes 46.5% less energy. At the largest key size, the binary field is considerably larger than its prime counterpart, and consequently, this configuration yields the lowest improvement (23.8% less energy). For all other field sizes, the GF (2 m ) extended design requires the least amount of energy. Thus, coupling GF (2 m ) support with an instruction cache would yield the lowest energy consumption without a coprocessor. For future work, we intend to evaluate this data point. Figure 6 shows an energy breakdown for the architectures with Monte and Billie. As the level of hardware acceleration increases, the energy consumed by the RAM decreases. Each level of hardware acceleration serves to reduce the access to RAM, decreasing dynamic energy. Likewise, each technique reduces the execution time, which decreases static energy. As illustrated in Figure 6 , Billie reduces the RAM energy even further by keeping the entire scalar point multiplication within her register file. The majority of the RAM accesses when Billie is used are from Pete while performing the additional computations modulo the group order.
The program ROM with full acceleration is an insignificant energy consumer because Pete is idle during the majority of a scalar point multiplication; hence, an instruction cache would not be beneficial. It is interesting to note that when Monte is used, Pete is still consuming most of the energy, despite being idle for the majority of the computation. There are two reasons for this: First, we are not using clock or power gating techniques because Pete is still fetching instructions for Monte, so Pete's clock network remains active. Second, Monte has less logic overhead than Pete, and the size is fixed, regardless of the field being used. Billie on the other hand, is the primary consumer of energy when used. Unfortunately, we were unable to effectively model Billie's register file with Cacti due to its non-standard access width. Thus, we had to synthesize the register file with flip-flops, which makes for an inefficient implementation. Furthermore, Billie is designed to scale in hardware with the field size and is consequently much larger than Pete. For example, the 163-bit implementation requires 45% more area than Pete, while the 571-bit implementation requires five times the area of Pete. However, when ECDSA is accelerated with Billie, on average only 43% of the execution time is spent on scalar point multiplication. The rest of the time, Billie is idle, wasting energy, while Pete performs the additional protocol operations that do not map to Billie. Specifically, inversion modulo the group order does not map to Billie, but is fairly computationally intense. Therefore, we believe that the architectures with Billie could benefit from clock gating. To model best-case clock gating in our system, we assume that Billie's dynamic power does not contribute to the overall energy consumption while Billie is idle. Although we recognize that this is an idealistic assumption, it reveals the upper bound of energy improvement. Figure 7 shows our results with the clock gated system on the right. We can see that clock gating Billie when not in use accounts for a 22.0% to 32.2% reduction in energy consumption. Also, the energy benefit of clock gating goes up with increased security levels. This is due to the fact that Billie's dynamic power increases with field size, but the relative time spent performing a scalar point multiply remains fairly constant.
C. GF (p) and GF (2 m ) Performance
To demonstrate the computational efficiency of Billie, Figure 8 shows the execution time of a scalar point multiplication versus the digit size of the multiplier. Note that our energy estimates discussed previously assumed a 3-bit digit size for the GF (2 m ) multiplication unit because it was shown to be energy-optimal in prior work [18] . For comparison, we graph prior work by Guo et al. that attempts to eliminate control bottlenecks by integrating an 8-bit microprocessor into their GF (2 m ) accelerator [12] . We plot points of prior work that were specifically noted to be energy optimal and for which we have an equivalent implementation. Note that we graph results for the sliding-window algorithm as well as the Montgomery scalar point multiplication. In all cases, our Montgomery algorithm implementation outperforms prior work. The coprocessor interface improves efficiency by queuing Billie instructions and allowing Pete to continue fetching instructions in parallel with the field math computation. Unlike Guo et al., each functional unit within Billie has its own set of operand registers such that field operations can run in parallel and complete out of order. Furthermore, Billie's register file is dual-ported in order to avoid bottlenecks associated with operand fetch and write back. Additionally, our sliding-window algorithm implementation outperforms both Montgomery implementations by a significant margin. We feel this comparison to prior work is fair because the functional units in our work are similarly designed [18] .
The increased performance of the sliding-window algorithm is responsible for some of the energy efficiency gain in our work. Increased performance leads directly to a shorter run time and a shorter run time leads to a lower amount of energy lost due to static power. The register file in Billie also allows flexibility in algorithm design. Individually, both the sliding-window algorithm used for single-point multiplication (signature) and the twin-point multiplication (verification) fit in the storage space of Billie, precomputed points included.
V. CONCLUSION
In this paper, we evaluated several implementations of GF (p) and GF (2 m ) ECC in terms of energy efficiency. We began by comparing GF (p) and GF (2 m ) ISEs on a RISC processor, showing that GF (2 m ) provides a 1.31 to 2.11 factor improvement in energy efficiency over GF (p). Then, we included a 4KB instruction cache in our system and showed that it can reduce the energy cost of ECC by up to 30%. Next, we evaluated our custom GF (2 m ) coprocessor, demonstrating a 2.8 to 3.61 factor improvement in energy efficiency compared to ISEs and greater than 2x improvement in performance compared to similarly configured GF (2 m ) accelerators. Finally, we estimated the benefit of clock gating Billie, showing a further energy reduction of up to 32.2% for 571-bit.
Our evaluation has revealed some interesting avenues for future work. First, we discovered that over half of Billie's energy is being consumed in the synthesized register file. Thus, we would like to evaluate the energy consumption of Billie with a register file implemented in more efficient memory (SRAM) technology, rather than flip-flops. To do so, we plan on modeling the register file with HSPICE, using a 45nm transistor model [25] . Second, we found that when accelerating GF (2 m ), the arithmetic operations modulo the group order (inversion specifically) become the limiting factor, because they do not map to the accelerator. In computer design terminology, Amdahl's law strikes again [26] . Therefore, we plan on investigating the energy impact of various methods for accelerating modulo inversion. Finally, we found that our binary-field accelerator does not scale well in terms of energy efficiency. This is primarily due to the increase in power consumption as the field size increases. As a result, we plan on experimenting with divide and conquer algorithms in software that would facilitate larger field size computation on a smaller variant of Billie.
